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A  bs  tract 


This  thesis  is  an  investigation  into  the  nature  of  decentralized  algorithms  in  distributed 
systems  of  computers.  We  consider  a  set  of  autonomous  processors,  each  with  local  storage, 
interconnected  through  bidirectional  communications  links.  The  system  is  not  presumed  to 
have  a  global  clock,  or  any  shared  global  memory,  or  an  a  priori  heirarchical  ordering  of 
processors,  so  that  there  may  be  a  designated  central  controller.  Instead,  every  processor  is 
equivalent  to  the  others,  and  they  communicate  with  one  another  through  message  passing. 

Algorithms  which  serve  a  global  function  in  such  a  distributed  system  depend  on  the 
cooperative  and  coordinated  behaviour  of  every  processor.  The  algorithms  which  operate  in 
this  environment  are  decentralized  in  that  each  processor  is  involved  in  the  incremental  exe¬ 
cution  of  the  algorithm.  The  initiation  of  an  algorithm  may  occur  at  several  processors,  even 
though  only  one  execution  of  the  algorithm  is  desired.  The  decentralized  algorithms  must  be 
designed  to  signal  termination  to  one  or  more  initiator  nodes. 

We  model  a  distributed  system  as  a  connected  graph  in  which  nodes  represent  proces¬ 
sors,  and  edges  represent  communications  links.  We  assume  that  each  node  has  a  unique 
name,  and  that  messages  take  a  finite  non-zero  time  to  traverse  each  edge.  The  algorithms 
with  which  we  are  concerned  are  decentralized  because  some  global  function  has  to  be  per¬ 
formed,  or  because  a  system-wide  property  of  the  graph  is  needed. 

Many  of  the  algorithms  which  we  have  discovered  are  decentralized  graph  algorithms. 
They  are  based  on  a  simple  technique  of  parallel  graph  traversal,  which  gives  rise  to  what  we 
have  called  echo  algorithms.  These  are  versatile  and  adaptable  in  simple  ways  to  solve  a 
large  number  of  graph-theoretic  problems.  These  include  finding  the  largest  member  of  a  set, 
sorting  the  set  of  nodes,  finding  the  largest  member  of  a  subset  of  nodes,  finding  the  bicon- 
nected  components  of  a  graph,  finding  the  minimum  spanning  tree  in  a  weighted  graph, 
finding  the  strongly  connected  components  of  a  directed  graph,  detecting  whether  a  directed 
graph  contains  a  knot ,  and  finding  the  shortest  path  tree  in  a  weighted  graph.  ' 

In  general,  echo  algorithms  operate  in  time  proportional  to  the  diameter  of  the  graph, 
use  message  passes  proportional  to  the  number  of  edges  in  the  graph,  and  storage  at  each 
node  bounded  by  n2  bits,  where  n  is  the  number  of  nodes  in  the  graph. 

We  have  addressed  specific  systems  problems  in  the  course  of  our  studies.  In  particular, 
we  have  motivated  the  need  for  a  decentralized  mutual  exclusion  mechanism  and  provided 
an  echo  algorithm  for  this  purpose.  We  have  used  this  mechanism  to  solve  the  multiple-file 
copy  update  problem  for  distributed  data  bases.  Finally,  we  have  used  a  version  of  the 
strongly  connected  component  algorithm  to  detect  whether  or  not  deadlock  is  present  in  a 
system  of  processes  and  resources  modelled  by  a  general  graph. 
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Foreword 


This  thesis  has  been  motivated  in  large  part  by  my  past  research  and  experience  with 
medical  information  systems.  The  complex  of  organizations  and  institutions  which  are  in¬ 
volved  in  the  health  care  of  individuals  produces  enormous  amounts  of  data  every  day.  Some 
of  these  data  are  concerned  with  the  administration  of  facilities  and  health  personnel,  but 
the  difficult  data  in  terms  of  information  handling  are  those  directly  related  to  patient  care. 
In  a  broad  sense,  these  include  not  just  hospital  in-patient  records  and  doctors’  out-patient 
records,  but  also  data  concerning  scheduling,  pharmaceutical  prescriptions  and  dispensing, 
epidemiological  analyses,  laboratory  and  pathology  data,  intensive  care  and  physiological 
monitoring  systems.  These  data,  created  through  the  interaction  of  patients  with  the  health 
system,  are  potentially  of  important  use.  They  provide  continuity  of  patient  care,  and  the 
monitoring  of  patterns  of  illness  and  the  use  of  facilities.  In  adition,  they  are  vital  to  the 
discovery  of  new  knowledge  in  health  research.  It  is  therefore  critical  for  these  data  to  be 
gathered  and  stored  in  systematic  ways,  to  allow  efficient  and  organized  retrieval  of  data  in¬ 
tegrated  along  several  dimensions. 

Some  important  characteristics  of  these  data  constrain  the  design  of  information  systems 
which  can  fulfill  the  necessary  functions.  Let  us  consider  these  briefly.  Many  autonomous 
sources  from  different  locations  create  large  amounts  of  data  concerning  the  same  entity. 
These  data  exhibit  high  locality  of  reference  in  that  information  generated  at  one  source  is 
used  subsequently  more  by  the  originator  than  by  anyone  else.  A  high  degree  of  idiosyncrasy 
exists  in  terms  of  the  perceived  information  structures  and  needs  among  different  users.  This 
is  particularly  acute  in  the  patient  records  different  doctors  keep  in  their  offices.  On  the  oth¬ 
er  hand,  the  actual  processing  requirements  of  these  data  are  rather  simple:  they  are  entered, 
deleted,  retrieved,  counted  or  displayed.  One  final  point  is  that  a  great  deal  of  security  and 
reliability  is  usually  desired  in  a  medical  information  system. 

We  have  argued  in  the  past  [CHANG  74]  for  the  use  of  distributed  data  base  systems  in 
this  environment.  The  use  of  small  autonomous  computers  connected  in  a  network,  with 
each  site  maintaining  its  data  locally  gives  a  degree  of  privacy  and  increased  reliability  to 
the  overall  system  difficult  to  match  in  large  single  machines.  A  distributed  system  enhances 
simultaneous  access  to  local  data,  while  permitting  global  queries  to  be  answered,  though 
perhaps  at  a  higher  cost.  Yet  such  global  queries  can  be  processed  concurrently  at  many 
different  nodes. 

One  of  the  most  important  problems  still  to  be  met  before  distributed  data  bases  can  be 
used  effectively  is  that  of  decentralized  control.  The  advantages  of  a  distributed  system  are 
diminished  by  the  need  to  use  a  central  controller,  which  is  both  a  bottleneck  and  a  critical 
resource  of  the  entire  system.  Thus,  decentralized  methods  of  coordination  must  be  provided 
in  order  to  allow  the  smooth  interaction  of  users  who  may  otherwise  interfere  with  each  oth¬ 
er  or  cause  inconsistent  results  to  occur.  This  is  the  motivation  behind  this  thesis,  which  is 
an  investigation  into  the  subject  of  decentralized  control.  While  the  thesis  does  not  address 
the  medical  area,  nor  specifically  the  problem  of  distributed  data  bases,  we  hope  that  it  will 
provide  insights  into  the  general  area  of  the  control  of  distributed  systems,  and  that  some  of 
its  techniques  will  find  specific  application  to  medical  information  systems. 
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Chapter  1 


Introduction 


1.1.  Motivation 

The  rapid  development  of  microcomputer  and  communications  technology  makes  feasi¬ 
ble  the  creation  of  networks  of  processors,  each  with  its  own  storage,  configured  in  arbitrary 
ways.  Such  distributed  computing  systems  might  be  used  to  share  scarce  resources,  or  data 
distributed  throughout  the  system,  or  to  achieve  parallelism  in  complex  calculations.  These 
global  objectives  can  only  be  accomplished  through  the  coordinated  behaviour  of  all 
processes.  Let  us  make  a  distinction  between  application  algorithms  which  solve  user  prob¬ 
lems,  and  control  algorithms,  which  are  used  to  coordinate  interacting  processes.  In  this 
thesis,  we  are  concerned  with  the  latter  class  of  algorithms. 

Control  algorithms  in  distributed  systems  can  be  centralized  or  decentralized.  As  an  ex¬ 
ample  of  the  first,  consider  a  network  in  which  a  central  controller  handled  the  requests  for 
resources  from  all  processes,  and  permitted  an  acquisition  only  if  it  would  not  lead  to 
deadlock.  A  decentralized  algorithm  would  not  use  a  central  controller.  Rather,  each  pro¬ 
cess  would  be  equivalent,  cooperating  with  its  neighbours  to  a  common  end.  The  study  of 
decentralized  control  algorithms  forms  the  central  subject  of  this  thesis.  In  the  following  sec¬ 
tions  of  this  chapter,  I  will  elaborate  on  the  definition  of  this  problem. 

12.  Centralized  Algorithms 

The  disadvantages  of  centralized  control  in  an  environment  supporting  simultaneous  ac¬ 
tivity  are  of  much  concern.  In  general,  the  use  of  a  global  resource  for  interprocess  coordi¬ 
nation  leads  to  three  problems.  The  first  is  that  such  a  resource  becomes  critical  to  the  net¬ 
work.  If  it  fails,  the  entire  network  fails.  The  second  is  that  a  critical  resource  is  a 
bottleneck  designed  into  the  system:  all  processes  requiring  coordination  run  at  the  rate  of 
activity  supported  by  the  critical  resource.  The  third  is  that  the  use  of  a  global  resource  im¬ 
poses  a  sequential  constraint  in  an  environment  which  is  intended  to  support  parallel  activi¬ 
ty.  Decentralized  algorithms  are  alternative  methods  of  control  which  may  avoid  many  of 
these  problems. 

13.  Decentralized  Algorithms 

A  great  deal  of  previous  work  on  the  coordination  of  parallel  processes  is  based  on  a 
critical  assumption:  that  processes  share  a  common  memory,  one  which  has  an  arbiter  at  the 
elemental  level,  such  that  an  indivisible  action  exists  at  each  cell.  Call  this  the  Common 
Memory  Assumption. 

The  majority  of  synchronization  mechanisms  in  the  literature  make  this  assumption. 
Thus,  semaphores  [DIJK  68],  critical  regions  [BRIN  72],  and  monitors  [HOAR  74]  all  use 
indivisible  operations  on  shared  objects  to  control  access  to  privileged  operations.  In  single 
processor  pseudo-concurrent  environments,  many  of  these  mechanisms  work  very  well.  How¬ 
ever,  in  the  distributed  environment,  it  is  not  desirable  to  depend  on  common  memory, 
which  usually  does  not  exist. 

Many  models  of  parallel  computation  also  assume  that  memory  is  shared.  Thus,  multi- 
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processor  architectures  such  as  C.mmp  [WULF  74]  base  their  communications  on  shared 
memory.  Furthermore,  many  studies  of  the  complexity  of  parallel  computation  also  use  the 
model  of  k-processors  [ARJO  78,  ECKS  77]  operating  on  shared  memory.  The  multi¬ 
processor  environment  in  which  we  are  interested  is  quite  different  in  this  respect. 

There  is  a  spectrum  of  centralized  control  methods  which  are  based  on  the  use  of  a  glo¬ 
bal  object.  We  say  that  a  multi-processor  system  exhibits  strong  central  control  if  all 
processes  are  scheduled  or  permitted  to  run  by  some  special  control  process,  or  there  is  a 
predetermined  hierarchy  of  processes.  On  the  other  hand,  a  multi-processor  system  exhibits 
weak  central  control  if  some  global  object  is  used  for  control  purposes,  such  as  shared 
memory.  This  can  be  used  to  implement  queues,  semaphores,  critical  regions  or  monitors. 

We  are  interested  in  control  algorithms  for  multi-processors  which  use  no  central  control 
at  all.  Message  passing  would  be  the  only  mechanism  of  communication,  and  of  providing  a 
shared  reference  point.  We  only  assume  a  communications  link,  local  storage  at  each  node 
and  unique  names  for  each  node.  It  may  be  argued  that  local  storage  with  an  arbiter 
mechanism  is  a  form  of  distributed  shared  memory.  However,  this  sharing  is  local  among 
immediate  neighbours,  rather  than  global. 

1.4.  Description  of  the  Environment 

We  model  a  distributed  computer  system  as  a  connected  and  undirected  graph  in  which 
each  node  is  a  processor  and  each  edge  a  bidirectional  communications  link.  Let  each  pro¬ 
cessor  have  its  own  local  storage,  and  assume  that  it  is  capable  of  supporting  multiple  local 
processes,  so  that  while  some  application  algorithms  are  executing,  it  is  also  capable  of  send¬ 
ing  and  receiving  messages,  and  of  executing  control  algorithms.  We  assume  a  message 
passing  capability  which  can  send  in  parallel  to  the  immediate  neighbours  of  a  node,  and  has 
enough  memory  at  each  node  to  store  all  incoming  messages.  We  do  not  presume  any  fixed 
speed  for  transmission  between  nodes,  but  we  do  not  allow  one  message  to  overtake  another 
on  a  link.  Thus,  messages  from  one  node  to  another  on  a  single  communications  link  arrive 
in  the  order  in  which  they  are  sent.  Furthermore,  we  will  assume  that  message  passing 
between  two  ends  of  a  link  use  a  protocol  requiring  positive  acknowledgment.  Thus,  either 
the  message  has  been  sent  successfully,  or  the  sender  knows  it  has  not,  and  retries.  Using 
this  scheme,  any  permanent  message  loss  occurs  not  in  transmission,  but  in  association  with 
node  failures. 

We  define  in  more  detail  two  models  of  distributed  environments:  a  circular 
configuration,  and  a  configuration  modelled  by  an  arbitrary  connected  graph. 

1.4.1.  Model  1.  Circular  Configuration 

1.  Assume  a  number  of  independent  processors,  each  of  which  supports  multi-processing, 
and  has  local  storage. 

2.  There  exists  a  unique  identifier  for  each  node.  The  set  of  node  names  has  a  total  order¬ 
ing. 

3.  Each  node  only  knows  its  own  identity  and  that  of  its  immediate  neighbours. 
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4.  Arrange  them  in  a  circle,  joined  by  communications  links,  such  that  each  node  has  two 
neighbours. 

5.  Let  the  message  passing  facility  carry  messages  in  one  consistent  direction  on  the  com¬ 
munications  links  between  nodes,  say  clockwise. 

1.4.2.  Model  2.  Arbitrary  Configuration 
1-3.  Same  as  Model  1. 

4.  Let  the  nodes  be  connected  in  an  arbitrary  network,  by  communications  links. 

5.  Let  the  message  passing  facility  on  each  link  be  bidirectional. 

This  model  is  easily  modified  to  support  the  study  of  networks  modelled  by  directed 
graphs,  or  networks  in  which  edges  between  nodes  have  weights.  In  later  portions  of  this 
thesis,  we  will  construct  algorithms  for  these  environments.  Note  that  although  the  model 
introduced  uses  a  network  as  an  abstraction,  by  no  means  are  we  specifically  addressing  the 
problems  only  of  geographically  large  networks.  First  of  all,  the  model  supports  any  loosely 
coupled  multiprocessor  system  with  autonomous  nodes  which  communicate  only  through 
message  passing.  Secondly,  it  is  useful  as  a  vehicle  for  understanding  the  nature  of  decentral¬ 
ized  algorithms,  within  the  context  of  a  variety  of  different  graph-theoretic  problems,  many 
of  which  also  have  practical  application. 

1.5  Three  Fundamental  Issues 

Decentralized  algorithms  are  organized  around  the  behaviour  of  a  number  of  auto¬ 
nomous  processors.  It  is  this  2ULonomv  that  leads  to  three  basic  issues  which  are  unique  to 
this  class  of  algorithms.  The  first  is  that  of  starting,  the  second  concerns  relative  speeds,  and 
the  third  issue  addresses  the  names  of  nodes.  A  decentralized  algorithm  can  be  character¬ 
ized,  in  part,  by  two  numbers.  The  first  is  the  Participation  Number ,  and  the  second  its 
Source  Number.  The  Participation  Number,  P#,  is  the  number  of  processes  which  are 
allowed  to  participate  in  the  algorithm.  In  general,  this  number  will  be  n,  the  number  of 
processes  in  the  system.  However,  as  we  shall  see  in  Chapter  3,  some  algorithms  only 
operate  on  a  subset  of  processes.  The  second  number,  the  Source  Number,  or  S=.  is  the 
number  of  processes  which  are  allowed  to  initiate  one  invocation  of  the  algorithm.  If  the  S# 
of  an  algorithm  is  1,  then  several  nodes  which  independently  start  an  algorithm  will  invoke 
several  distinct  executions  of  the  algorithm.  On  the  other  hand,  if  the  S#  is  >  1.  then  the 
algorithm  is  designed  to  expect  several  independent  processes  to  try  to  initiate  a  single  exe¬ 
cution  of  the  algorithm.  The  first  we  call  Single  Source ,  and  the  second  Multi-Source.  In 
general,  the  S#  for  an  algorithm  is  either  1  or  n. 

The  next  issue  is  that  of  speed.  Rosenstiehl’s  [ROSE  72]  model  assumes  that  all  finite- 
state  machines  behave  synchronously.  Clearly  our  model  is  one  in  which  different  processors 
may  have  different  speeds,  and  communications  speeds  may  also  vary.  Yet  if  speed 
differences  are  too  large,  it  becomes  difficult  to  make  any  meaningful  statements  about  the 
expected  execution  time  of  the  algorithm.  In  fact,  if  some  part  of  the  system  is  very  slow,  it 
can  cause  the  entire  distributed  system  to  run  at  that  speed,  and  make  the  system  as  a  whole 
unacceptable.  In  general,  we  will  assume  that  all  processors  have  the  same  speed. 

The  third  issue  is  concerned  with  the  relationship  between  the  names  of  nodes  and  their 
relative  rank  or  priority.  A  name  given  to  a  node  in  a  network  is  invariant  with  time,  and 
serves  to  uniquely  identify  that  node  to  the  rest  of  the  nodes  in  the  network.  The  rank  or 
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priority  of  a  node,  on  the  other  hand,  may  change  with  time.  For  example,  it  may  be  associ¬ 
ated  with  a  particular  time  at  which  some  activity  started  at  the  node,  so  that  earlier  nodes 
have  a  higher  priority.  It  is  important  to  realize  that  the  attribute  of  priority  for  a  node  may 
be  taken  from  different  domains  for  different  purposes.  Thus,  it  is  sometimes  convenient  to 
take  the  name  of  a  node  as  its  rank,  for  example,  in  choosing  a  single  node  from  among 
many  to  proceed.  In  order  for  this  to  be  possible,  a  total  ordering  on  the  set  of  names  must 
be  assumed.  For  simplicity,  we  assume  in  the  rest  of  this  thesis,  unless  specifically  stated 
otherwise,  that  the  names  of  nodes  are  used  for  their  priority  values. 

1.6  The  Study  of  Decentralized  Algorithms 

The  study  of  decentralized  control  algorithms  will  be  approached  in  the  following  way: 
we  will  motivate  each  algorithm,  present  it,  make  arguments  for  its  correctness,  consider  its 
behaviour  in  the  face  of  failures,  and  study  its  efficiency  in  terms  of  three  metrics.  These  are: 
total  number  of  messages,  total  number  of  bits  of  storage  needed  at  each  node,  and  elapsed 
communication  time.  This  last  measure  requires  some  clarification. 

The  total  time  elapsed  in  executing  a  decentralized  algorithm  has  two  components.  The 
first  is  processor  time,  and  the  second  is  communication  time.  In  order  to  study  elapsed 
time,  we  will  make  several  assumptions  in  our  model  with  respect  to  these  times.  First  of  all, 
since  we  have  an  asynchronous  system,  we  need  not  have  any  guarantees  as  to  how  fast  mes¬ 
sages  move,  except  that  all  messages  take  a  bounded  time  to  travel  a  link.  However,  for 
purposes  of  analysis  of  elapsed  time,  we  will  consider  the  average  case,  that  messages  take 
approximately  the  same  time  to  traverse  an  edge. 

Secondly,  we  assume  that  processor  time  required  to  process  a  message  is  very  small 
compared  to  communication  time.  We  do  this  for  two  reasons:  we  do  not  a  priori  know 
what  power  each  processor  may  have,  and  we  are  interested  in  studying  the  effect  that 
decentralization  has  on  the  algorithm.  Thus,  we  are  really  interested  in  studying  the  com¬ 
munication  time  of  an  algorithm. 

Thirdly,  there  is  the  question  of  queueing.  A  node  may  receive  many  messages  simul¬ 
taneously,  and  some  messages  may  not  be  serviced  immediately  by  the  processor.  We 
include  queueing  costs  in  communication  costs,  and  assume  that  a  unit  cost  of  communica¬ 
tion  includes  an  average  queueing  delay. 

The  effect  of  these  assumptions  is  to  allow  us  to  speak  of  elapsed  communications  times 
in  terms  of  number  of  edges  traversed.  This  gives  us  some  feeling  for  the  algorithms  which 
operate  in  general  in  an  asynchronous  system,  in  which  the  distribution  of  edge  traversal 
times  may  in  fact  be  quite  arbitrary.  The  study  of  total  elapsed  time,  which  will  include 
processor  and  queueing  times,  is  of  course  very  important.  However,  the  analytical  tools 
required  to  do  this  in  a  rigorous  way  involve  primarily  congestion  theory  and  queueing 
theory,  which  differ  greatly  from  the  basically  combinatorial  arguments  we  employ  in  this 
thesis.  The  analyses  which  we  undertake  are  in  an  attempt  to  understand  the  structure  and 
complexity  of  decentralized  algorithms,  and  mixing  the  two  approaches  may  well  conceal  the 
intuitive  understanding  of  our  results.  We  feel  that  the  two  points  of  view  are  separable, 
complementary,  and  independant.  The  extension  into  the  performance  of  decentralized  algo¬ 
rithms  using  stochastic  methods  will  be  a  most  appropriate  area  for  further  research. 

Finally,  we  make  a  few  comments  on  notation  and  description,  which  apply  to  the  rest 
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of  the  thesis.  First,  whenever  we  write  log/i,  we  will  always  mean  log2fl.  Next,  the  algor¬ 
ithms  are  presented  in  the  form  of  numbered  paragraphs.  They  are  to  be  considered  sequen¬ 
tially,  so  that  if  paragraph  1  does  not  apply,  consider  paragraph  2,  etc.  After  an  applicable 
paragraph  has  been  encountered,  the  execution  ends  after  the  instructions  in  the  paragraph 
have  been  followed,  unless  otherwise  specified. 

1.7.  Structure  of  the  Thesis 

The  rest  of  the  thesis  has  the  following  structure.  In  Chapter  2,  we  review  the  literature 
in  the  area  of  decentralized  control  algorithms,  which  provides  the  background  to  the  subject 
of  this  thesis.  In  Chapter  3,  we  introduce  the  n-philosophers  problem  within  the  context  of  a 
distributed  system.  A  decentralized  algorithm  for  solving  this  problem  is  given,  one  which 
detects  deadlock  and  avoids  starvation.  The  avoidance  of  starvation  requires  a  mutual 
exclusion  mechanism  which  will  allow  only  one  philosopher  to  issue  a  relinquish  message 
when  deadlock  is  present.  We  introduce  several  variants  of  a  decentralized  mutual  exclusion 
mechanism,  and  show  their  application  to  the  multiple-copy  file  update  problem. 

Chapter  4  presents  a  parallel  method  for  the  traversal  of  a  general  connected  graph.  This 
leads  to  a  class  of  decentralized  algorithms  which  we  call  echo  algorithms,  which  take  0(e) 
messages,  and  0(D)  communication  time,  where  e  is  the  number  of  edges  in  the  graph,  and 
D  is  the  diameter  of  the  graph.  Some  simple  echo  algorithms  will  be  presented  in  this  chap¬ 
ter. 


Chapter  5  describes  an  algorithm  which  identifies  the  biconnected  components  of  a  con¬ 
nected  graph,  which  are  those  subgraphs  which  have  no  internal  articulation  points.  An  arti¬ 
culation  point  a  of  a  graph  is  a  node  for  which  there  exists  nodes  v  and  w  such  that  all  paths 
between  v  and  w  contain  a.  Biconnected  components  are  joined  to  each  other  by  articula¬ 
tion  points,  and  it  is  important  to  be  able  to  identify  these  nodes  of  potential  disjunction  in  a 
distributed  system.  In  addition,  we  present  a  decentralized  version  of  Kruskal’s  [KRUS  56] 
sequential  algorithm  for  finding  the  minimum  spanning  tree  of  a  weighted  graph. 

Chapter  6  deals  with  algorithms  on  directed  graphs.  An  echo  algorithm  for  finding  the 
strongly  connected  components  of  a  directed  graph  is  given.  A  special  case  of  this  algorithm 
is  the  detection  of  a  knot,  a  graph  configuration  in  which  the  set  of  nodes  reachable  via 
directed  paths  from  every  node  is  the  entire  set  of  nodes  of  the  graph.  We  then  show  how 
the  knot  detection  algorithm  may  be  used  in  a  distributed  system  of  processes  and  resources 
modelled  by  a  directed  graph  to  detect  the  presence  of  deadlock. 

Chapter  7  sketches  a  particular  shortest  path  algorithm  which  finds  the  shortest  path  tree 
rooted  at  a  specific  node.  The  analysis  of  the  worst  case  is  bounded  by  the  consideration  of 
a  fully  connected  graph,  and  a  version  using  a  decentralized  clock  is  presented,  which 
requires  only  £>(«3)  total  messages.  This  method  is  then  applied  to  the  general  graph  prob¬ 
lem. 

Chapter  8  will  be  a  concluding  chapter,  giving  a  discussion  of  the  results  of  the  thesis, 
and  suggesting  areas  for  further  research. 
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Chapter  2 


Background 


2.1.  Introduction 

In  this  chapter,  we  will  review  the  past  work  in  decentralized  control  which  is  either 
relevant  or  contributory  to  the  ideas  put  forward  in  this  thesis. 

2.2.  Dijkstra’s  Early  Paper 

One  of  the  first  relevant  papers  to  appear  in  the  literature  is  a  note  by  Dijkstra  [DIJK 
74]  concerning  the  existence  of  algorithms  which  use  distributed  control  but  are  self- 
stabilizing. 

Given  a  connected  graph  of  N  finite  state  machines,  define  a  privilege  of  a  machine  as  a 
boolean  function  of  its  state  and  the  state  of  its  neighbours.  If  such  a  function  is  true ,  we 
say  that  the  privilege  is  present.  Assume  an  arbitrary  order  on  the  moves  of  the  machines, 
where  a  move  by  a  machine  takes  a  machine  with  a  true  privilege  into  another  state. 

The  global  criteria  for  a  legitimate  system  is  as  follows:  for  each  legitimate  state,  a 
privilege  must  be  present,  and  each  possible  move  should  take  the  system  into  another  legi¬ 
timate  state.  Furthermore,  each  privilege  must  be  present  in  at  least  one  legitimate  state, 
and  for  any  pair  of  legitimate  states,  tthere  must  exist  a  sequence  of  moves  from  one  to  the 
other. 

Finally,  the  system  is  self-stabilizing  if  and  only  if  the  system  is  guaranteed  to  remain  in 
a  legitimate  state,  regardless  of  the  initial  state  and  the  sequence  of  moves.  Dijkstra  showed 
the  existence  of  non-trivial  self-stabilizing  systems  for  the  three  cases  of  K-state  Machines 
(K>N),  Four-state  and  Three-state  Machines. 

These  systems  require  no  central  store,  and  no  global  controller.  Each  machine  behaves 
according  to  its  local  rules.  Unfortunately,  it  was  found  necessary  to  designate  TOP  and 
BOTTOM  machines,  which  means  that  not  all  processes  are  equivalent. 

2.3.  Lamport’s  Work 

There  are  two  significant  contributions  to  the  study  of  distributed  computer  systems  by 
Lamport.  The  first  [LAMP  77]  deals  with  concurrent  reading  and  writing  in  multiprocessor 
systems,  using  an  approach  which  requires  no  mutual  exclusion.  Each  reader  or  writer  sim¬ 
ply  executes  its  function.  The  essence  of  the  idea  is  that  a  resource  has  two  variables  vl  and 
v2,  which  a  writer  updates  before  and  after  writing  in  the  order  vl,v2.  A  reader  reads  them 
in  the  order  v2,vl,  before  and  after  reading.  If  v2  =  vl  for  the  reader  then  it  has  read  a 
single  consistent  copy  of  the  resource.  This  clever  approach  depends,  unfortunately,  on  the 
use  of  global  shared  memory. 

The  second  aspect  of  Lamport’s  research  [LAMP  78]  deals  with  the  ordering  of  events 
in  a  distributed  system,  without  using  a  global  clock.  Assume  the  existence  of  local  clocks 
which  are  monotonically  increasing  though  not  necessarily  accurate.  Clocks  can  be  relatively 
synchronized  by  time-stamping  a  message  originating  from  each  node,  and  setting  the  clock 
at  a  receiving  node  to  be  equal  to  or  greater  than  the  time-stamp  on  an  arriving  message. 
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Using  this  system  of  clocks,  Lamport  stated  that  it  is  possible  to  control  exclusive  access 
to  a  resource  for  simultaneous  requestors,  by  giving  the  resource  to  a  process  whose  own  re¬ 
quest  is  earlier  than  that  of  all  other  competing  requests  in  its  request-queue,  given  that  all 
other  processes  have  been  heard  from.  Note  that  there  is  only  one  node  which  will  decide  to 
go  ahead  and  acquire  the  resource,  based  on  local  information  acquired  system-wide  by  mes¬ 
sages.  Each  contending  node  must  send  a  request  to  everyone,  and  receives  a  message  back 
from  all  nodes.  We  will  later  show  techniques  which  only  require  n  log«  messages  instead  of 
n2  messages  to  accomplish  the  same  task. 

2.4.  Multiple  Copy  Update  Problem 

In  a  distributed  data  base  implemented  on  a  network  of  processors,  each  having  a  copy 
of  the  same  file,  it  is  a  non-trivial  problem  to  maintain  the  multiple  copies  consistent  in  the 
face  of  simultaneous  updates  coming  from  different  nodes.  This  problem  has  motivated 
much  study,  and  we  will  present  here  several  of  the  solutions  which  use  decentralized  con¬ 
trol. 

Thomas  [THOM  75]  devised  a  decentralized  method  of  solving  the  multiple-copy  update 
problem  using  three  strategies:  time-stamps,  majority  voting,  and  a  priority  assigned  to  each 
node.  An  update  request  consists  of  the  identity  and  priority  of  the  requestor,  a  list  of  vari¬ 
ables  to  be  updated,  a  list  of  base  variables  on  which  the  update  is  based,  the  timestamps  for 
the  base  variables,  and  the  time-stamp  for  the  update  request,  which  is  the  maximum  of  the 
local  time  and  the  most  recent  time  on  a  base  variable. 

An  update  request  is  circulated.  At  each  node,  the  Data  Base  Management  Process 
(DBMP)  votes  on  the  request.  It  votes  NO  if  any  base  variable  has  a  time-stamp  which  is 
obsolete,  i.e.,  an  update  has  been  approved,  but  the  requestor  did  not  know  about  it.  It  votes 
OK  if  the  base  variables  are  current  and  the  request  does  not  conflict  with  any  pending  re¬ 
quests  at  the  DBMP,  in  the  sense  that  two  updates  affect  some  common  base  variable.  A 
DBMP  voting  OK  records  the  update  as  pending  there,  and  sends  the  request  on  to  the  next 
node.  It  votes  Deadlock  Reject  (DR)  if  base  variable  timestamps  are  current  but  a  conflict 
exists  with  a  pending  update  request  of  a  higher  priority.  If  enough  DR  votes  accumulate  to 
prevent  a  majority  OK  vote,  then  the  update  can  be  rejected.  Finally,  a  DBMP  may  defer 
voting  if  either  the  base  variables  are  more  current  than  those  of  the  data  base  at  the 
DBMP,  or  else  a  conflict  exists  with  a  pending  update  of  lower  priority. 

If  an  update  is  rejected,  the  REJECTION  must  be  sent  from  the  DBMP  to  all  others,  so 
they  may  process  their  pending  requests.  If  an  update  is  accepted,  i.e.,  the  DBMP’s  OK  vote 
produces  a  majority,  then  the  DBMP  must  update  its  copy,  and  send  an  update  to  all  other 
copies.  It  is  easy  to  see  that  no  two  conflicting  updates  can  receive  majority  OK  in  the  same 
election.  Thomas  extends  this  work  to  consideration  of  failure  of  nodes,  and  shows  that  the 
method  can  be  robust. 

Essentially,  this  is  a  decentralized  method  of  checking  that  conflicting  requests  cannot 
simultaneously  be  allowed  to  update.  The  observation  of  majority  agreement  is  a  nice 
simplification  but  causes  three  problems  in  the  general  case.  First  of  all,  all  nodes  must 
know  the  extent  of  the  network.  Secondly,  all  DBMPs,  even  those  not  having  originated  an 
update  request,  must  be  involved  in  the  voting,  and  thirdly,  the  method  requires  a  sequential 
traversal  of  all  the  nodes  of  the  network,  even  if  the  network  is  configured  as  an  arbitrary 
graph.  This  can  be  needlessly  expensive,  if  parallel  traversal  is  possible. 
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Although  Rosenkrantz’s  [ROSE  77]  model  deals  explicitly  with  the  coordination  of  con¬ 
current  transactions  which  operate  on  different  parts  of  a  database  physically  distributed  in  a 
network,  the  techniques  proposed  are  directly  applicable  to  the  multiple  copy  problem.  The 
latter  can  be  considered  a  special  case  of  the  first,  in  which  each  copy  is  a  different  part  of 
the  entire  database.  In  the  rest  of  this  thesis,  we  will  refer  to  Rosenkrantz’s  algorithms 
within  this  framework.  Assume  that  an  update  process  initiated  at  a  node  gets  a  unique 
number  consisting  of  the  time  of  initiation  followed  by  the  node  number.  Thus,  older  pro¬ 
cesses  have  smaller  numbers.  A  process  moves  from  node  to  node,  making  temporary  up¬ 
dates.  When  these  have  all  been  done,  it  may  initiate  TERMINATION,  going  around  the 
nodes  and  making  the  updates  permanent.  If  the  process  ABORTS,  it  must  undo  all  tem¬ 
porary  updates.  A  request  P  is  said  to  be  in  conflict  with  a  process  Q  if  P  is  a  read  request 
and  Q  is  a  granted  write  request  which  has  not  yet  terminated,  or  if  P  is  a  write  request  and 
Q  is  a  granted  read  or  write  request  which  has  not  yet  terminated.  Then  the  conflict  of  P 
and  Q  are  resolved  according  to  their  process  numbers,  as  follows. 

In  the  DIE-WAIT  system,  the  request  P  will  wait  if  it  is  older,  else  it  DIES,  i.e.,  aborts 
and  restarts.  In  the  WOUND-WAIT  system,  if  P  is  older,  it  wounds  Q,  i.e.,  causes  Q  to 
abort  if  Q  has  not  started  termination  (else  Q  completes).  If  P  is  younger,  it  waits.  Both 
these  methods  depend  on  using  an  ordering  of  process  names  to  give  precedence  to  one  of  a 
number  of  conflicting  updates.  Updates  are  allowed  to  run  to  conflict,  then  rolled  back  if 
necessary.  However,  all  updates  are  temporary,  and  a  second  visit  to  all  nodes  is  needed  to 
finalize  the  update.  A  method  which  will  allow  only  one  process  to  update,  however,  com¬ 
pletely  avoids  the  need  to  roll  back.  We  will  later  introduce  such  a  method  of  update. 

One  solution  which  Ellis  [ELLI  77a]  proposed  is  similiar  in  nature.  An  update  is  al¬ 
lowed  to  proceed  if  it  receives  OK  from  all  other  nodes,  based  on  a  notion  of  priority  for¬ 
med  from  the  triple:  eventcount,  attempt  number,  node  number.  Two  conflicting  update  re¬ 
quests  would  be  resolved  by  allowing  the  higher  priority  one  to  go  first.  An  eventcount,  dev¬ 
ised  by  Kanodia  and  Reed  [REED  79],  is  a  non-decreasing  global  counter  associated  with 
the  number  of  events  which  have  occurred  in  a  particular  class.  It  uses  two  primitives,  ad¬ 
vance  and  read.  Assuming  that  single  site  atomic  advance  and  read  can  be  implemented, 
then  a  distributed  eventcount  can  be  constructed  by  a  distributed  read  returning  the  value  of 
component  eventcounts  at  all  sites,  and  summing  them,  and  a  distributed  advance  simply  in¬ 
creasing  the  eventcount  of  a  home  site. 

A  different  algorithm  suggested  later  by  Ellis  [ELLI  77b]  is  of  greater  interest.  Given  a 
circle  of  n  processors,  numbered  from  1  to  n,  let  them  be  arranged  in  increasing  order  such 
that  messages  only  go  from  a  lower  numbered  node  to  its  next  higher  node,  modulo  n.  Two 
phases  are  needed  to  perform  an  update.  In  the  first,  a  request  message  originates  from  one 
or  more  nodes  around  the  circle.  When  a  node  gets  its  request  message  back,  it  initiates  an 
update  phase,  in  which  an  update  is  propagated  around  the  circle.  Conflicting  concurrent  re¬ 
quests  are  handled  by  a  lower  priority  update  being  temporarily  stored  at  the  higher  priority 
node.  After  a  node  completes  its  update,  it  allows  a  stored  request  to  continue.  The  last 
node  which  updates  has  no  stored  request,  and  therefore  sends  a  final  update.  After  each 
node  performs  a  final  update,  it  returns  to  an  original  idle  state,  allowing  another  set  of  up¬ 
dates  to  be  synchronized. 

Gelenbe  and  Sevcik  [GELE  78]  suggested  that  an  update  originating  at  a  node  should  be 
immediately  broadcast  to  all  other  nodes.  They  then  investigated  various  policies  of  decid- 
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ing  when  an  update  arriving  at  a  node  should  be  applied.  If  it  is  done  immediately,  then  roll 
back  of  some  conflicting  updates  may  be  necessary.  If  a  finite  delay  is  applied,  this  decreases 
the  number  of  roll  backs,  but  increases  the  time  required  for  updates.  An  interesting  alter¬ 
native  is  for  a  node  to  be  sure  that  no  earlier  updates  have  been  missed,  before  applying  an 
update.  This  can  be  done  by  using  time  stamps  and  sequence  numbers,  such  that  a  node  will 
update  a  request  only  after  it  has  received  all  messages  up  to  a  message  later  than  than  that 
of  the  request,  from  all  other  nodes. 

2.5.  Decentralized  Deadlock  Detection 

In  studying  the  problem  of  deadlock  in  distributed  data  bases,  Chu  and  Ohlmacher 
[CHU  74]  took  the  approach  of  deadlock  prevention,  by  requiring  all  resources  to  be  allo¬ 
cated  to  a  process  before  it  is  scheduled.  More  recently,  attention  has  been  focused  on 
deadlock  detection  in  distributed  environments.  Chandra  [CHAND  74],  and  Mahmoud 
[MAHM  76]  both  proposed  that  each  node  maintain  global  information  as  to  process  activi¬ 
ty  and  resource  requests  and  allocations.  These  methods  are  unsatisfactory  in  that  they  re¬ 
quire  large  amounts  of  message  traffic  and  local  storage,  in  addition  to  being  very  sensitive 
to  node  and  link  failures. 

Goldman  [GOLD  77]  uses  the  concept  of  an  Ordered  Blocked  Process  List  (OBPL)  to 
keep  track  of  the  dependencies  of  processes  waiting  for  resources  held  by  other  processes. 
Each  process  is  only  allowed  to  request  one  resource  at  a  time.  An  OBPL  is  “expanded”  by 
being  passed  around  the  blocked  processes.  If  a  cycle  is  found,  then  deadlock  has  been 
discovered.  The  constraint  of  requesting  a  single  resource  at  a  time  is  a  severe  one.  There  is 
also  a  good  deal  of  redundancy  in  requiring  OBPL’s  to  be  built,  and  then  expanded. 

The  proposal  of  Isloor  and  Marsland  [ISLO  78]  is  based  on  Holt’s  [HOLT  71]  concept 
of  a  knot  in  a  process-resource  graph  being  equivalent  to  the  presence  of  deadlock.  They 
maintain  a  set  of  reachable  nodes  for  each  node  in  the  distributed  system,  and  then  update 
these  tables  as  the  graph  of  process-resource  interactions  changes.  Deadlock  is  found  at  a 
node  i  if  its  reachable  set  contains  itself.  This  result  is  dependent  on  the  assumption  that  a 
resource  is  single-unit  and  can  be  held  by  one  process  at  a  time.  We  will  presume  a  more 
general  context  in  which  incremental  acquisitions  of  multi-unit  resources  is  permitted,  and 
present  a  decentralized  deadlock  detection  algorithm  which  follows  the  edges  of  the  process- 
resource  graph  rather  than  maintaining  the  reach  sets  of  nodes. 

2.6.  LeLann 

LeLann  [LELA  77]  put  forward  a  decentralized  method  of  control  through  the  use  of 
circulating  tokens.  The  environment  is  any  physical  network,  onto  which  a  virtual  ring  is 
mapped.  When  a  process  has  a  token,  it  can  do  some  work;  when  it  does  not,  it  waits.  An 
interesting  problem  which  he  addressed  is  the  situation  in  which  a  control  token  is  lost.  If 
time-out  is  the  mechanism  of  detection,  then  several  processes  could  detect  this  fault; 
however,  only  one  new  control  token  should  be  emitted. 

Assume  that  the  names  of  the  nodes  have  a  total  ordering,  and  are  used  as  their  rank. 
Then  by  holding  an  election,  the  node  with  the  smallest  name  finds  itself  appointed  to  emit 
the  new  token.  Each  node  has  a  timer,  which  is  not  necessarily  the  same  as  that  of  the  oth¬ 
ers.  A  node  which  times  out  sends  out  a  candidate  token.  If  it  sees  a  control  token  before  it 
gets  its  candidate  token  back,  it  will  remove  the  candidate  token  -  in  other  words,  perceive 
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that  election  is  unnecessary.  Each  node  which  has  sent  out  a  candidate  token  is  in  the  elec¬ 
tion,  and  remains  in  the  election  as  long  as  it  sees  no  control  token  or  a  candidate  token 
smaller  than  itself.  The  node  which  remains  in  the  election  and  gets  its  candidate  token  back 
will  emit  the  new  control  token.  Without  loss  of  generality,  the  highest  numbered  node  could 
also  have  been  elected.  In  general,  the  variation  among  timers  precludes  the  need  for  all  no¬ 
des  to  participate  in  any  particular  election.  Note  that  this  is  not  only  decentralized  appoint¬ 
ment  of  a  leader,  it  is  also  a  solution  which  ensures  that  the  elected  process  always  knows  it 
has  been  elected.  If  we  took  some  other  method  by  which  any  node  which  wanted  to  could 
find  the  lowest  node  in  the  network,  the  elected  node  still  has  to  be  informed.  Not  only  does 
there  have  to  be  an  election  phase,  there  must  also  be  an  informing  phase.  Furthermore,  an 
arbitrary  number  of  nodes  may  perform  this  task.  LeLann’s  method  provides  an  elegant 
solution  which  takes  all  three  aspects  into  account. 

An  extension  to  the  use  of  control  tokens  is  presented  in  [LELA  78],  Adopting  the  con¬ 
cept  of  sequencers  [REED  79],  a  control  token  carries  with  it  a  number  which  is  increment¬ 
ed  every  time  it  is  used.  Every  non-local  activity  originating  at  a  node  must  be  given  such  a 
number,  which  is  in  essence  a  ticket.  The  synchronization  of  conflicting  activities  in  a  distri¬ 
buted  system  can  then  be  accomplished  by  each  node  knowing,  for  a  given  request,  whether 
there  is  an  earlier  one  which  has  not  yet  been  acted  upon.  In  anticipation  of  the  fact  that  the 
control  token  may  take  a  long  time  to  traverse  the  virtual  ring  of  processors,  LeLann 
presents  algorithms  which  give  multiple  tickets  at  a  node,  some  for  queued  pending  events, 
and  others  in  anticipation  of  events  to  come  before  the  next  arrival  of  the  control  token. 

In  Chapter  4  we  will  give  a  more  detailed  analysis  of  these  algorithms,  and  compare 
them  with  some  techniques  which  we  introduce  in  Chapters  3  and  4. 

2.7.  Spira 

The  finding  of  a  minimum  spanning  tree,  given  arbitrary  positive  weights  on  the  edges  of 
a  connected  undirected  graph,  is  an  important  problem  for  network  analysis.  Spira  [SPIR 
77]  approached  this  problem  for  a  network  of  processes  joined  by  communications  links  as 
edges,  by  using  a  decentralized  technique.  Essentially,  fragments  are  recursively  joined  on 
their  minimum  out-going  edges.  Level  0  fragments  are  nodes.  A  level  1  fragment  has  a  core 
pair  of  nodes  which  are  each  other’s  nearest  neighbours.  A  level  1  fragment  also  includes  all 
nodes  whose  nearest  neighbours  are  in  the  fragment.  Each  core  finds  the  minimum  out¬ 
going  edge  of  a  fragment,  and  adjacent  cores  are  united  into  higher  level  fragments  through 
such  edges. 

This  approach  was  first  described  by  Rosenstiehl  [ROSE  72],  and  is  an  ingenious  but 
somewhat  complicated  approach.  Spira  claims  that  the  average  communication  complexity, 
which  is  the  product  of  mean  message  length  and  number  of  edges  traversed,  is  0(n  log2??) 
bits.  There  are  seven  message  types  used,  and  it  is  difficult  to  convince  oneself  of  the 
correctness  of  the  algorithm  or  of  the  complexity  result.  Gallager  [GALL  79]  has  introduced 
a  variation  of  Spira’s  method,  essentially  to  allow  y-level  fragments  to  connect  to  A>level 
fragments  when  j  <  k,  which  produces  some  increase  in  efficiency.  We  will  later  show  that  a 
decentralized  version  of  a  standard  sequential  algorithm  produces  a  simple  and  efficient 
method  of  finding  the  minimal  spanning  tree. 

2.8.  Rosenstiehl 
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Some  of  the  earliest  and  most  interesting  work  on  decentralized  control  appeared  in  a 
paper  on  Intelligent  Graphs  by  Rosenstiehl  [ROSE  72].  He  considers  a  finite  network  of 
identical  finite-state  automata.  The  edges  of  the  graph  modelling  the  network  are  the  limbs 
of  the  network  automata.  By  considering  the  state  of  an  automaton  as  a  tuple,  one  can  then 
assign  discrete  states  of  limbs.  Using  the  synchronous  transition  of  all  automata  sensing 
their  neighbouring  states  and  going  into  new  states  at  specific  intervals  of  time,  he  was  able 
to  construct  algorithms  which  solve:  recoil  algorithms  for  mazes,  construction  of  an  Euler 
path,  block  decomposition,  and  Hamiltonian  cycle  detection,  among  others. 

Consider,  for  example,  the  algorithm  for  a  recoil  automata  which  will  traverse  a  la¬ 
byrinth.  In  essence,  one  starts  with  a  particular  limb  at  a  particular  node,  and  tries  to  build 
a  sequence  of  nodes  (a  complete  word)  by  traversing  unused  limbs  first,  and  retracing  the 
word  in  the  opposite  direction,  if  necessary. 

RosenstiehPs  work  is  based  on  synchronous  finite  automata,  and  thus  differs  from  the 
model  which  we  have  introduced  in  several  respects:  first,  there  is  no  local  storage,  which 
makes  his  algorithms  rather  complicated,  as  states  have  to  become  storage  mechanisms; 
secondly,  he  assumes  synchronous  simultaneous  sensing  of  neighbouring  states,  which  is 
more  restrictive  than  the  asynchronous  message  system  which  we  have  defined;  thirdly,  the 
need  to  express  algorithms  using  state  transitions  on  finite-state  automata  introduces  a  con¬ 
siderable  difficulty  in  the  natural  expression  of  these  algorithms. 

2.9.  Decentralized  Routing 

In  a  computer  network  of  arbitrary  configuration,  routing  of  messages  between  nodes  go 
through  intermediate  nodes.  Assuming  that  each  node  has  a  table  which  shows  both  distance 
to  a  particular  node,  and  the  next  node  in  the  route  to  that  node,  the  problem  Tajibnapis 
[TAJI  77]  addressed  was  that  of  updating  these  tables  in  the  event  of  a  configuration 
change.  The  decentralized  solution  is  simple:  if  a  node  detects  a  crash  (or  gets  a  new  neigh¬ 
bour)  then  it  updates  its  table,  and  sends  a  copy  to  all  its  neighbours.  These  neighbours  in 
turn  update  their  tables,  and  if  a  change  should  result,  sends  this  new  table  to  its  neighbours. 
He  shows  that  this  basic  protocol  is  guaranteed  to  terminate  if  the  network  stabilizes,  and 
will  work  in  any  pattern  of  outages  and  recovery.  However,  he  acknowledges  that  the  num¬ 
ber  of  messages  is  an  exponential  function  of  the  number  of  nodes  in  the  network. 

Segall  and  Merlin  [SEGA  78]  have  introduced  a  decentralized  algorithm  for  maintaining 
the  best  estimated  routing  paths  to  a  particular  node,  in  a  dynamic  network  in  which  path 
weights  may  change  and  nodes  may  fail  or  enter  the  system  in  an  unpredictable  pattern. 
Consider  a  destination  SINK  node,  for  which  each  node  in  the  network  already  has  a  prefer¬ 
red  edge  which  will  take  it  towards  SINK,  such  that  they  form  a  directed  spanning  tree. 
Then  the  algorithm  will  maintain  this  tree  in  the  face  of  changes.  Furthermore,  if  the  graph 
remains  static,  then  in  a  finite  number  of  repetitions  of  the  algorithm,  this  spanning  tree  will 
converge  to  the  strict  shortest  path  tree  for  the  SINK  node,  from  all  other  nodes. 

Consider  only  control  messages,  which  bear  the  cumulated  weight  of  its  path  taken  from 
SINK.  Let  SINK  start  by  sending  a  message  to  all  its  neighbours.  A  node  collects  messages 
until  it  gets  a  message  from  its  preferred  edge.  It  then  finds  the  minimum  weight  of  the 
paths  which  have  so  far  arrived,  and  sends  this,  as  the  best  estimate,  on  all  its  edges  except 
its  preferred  edge.  When  it  gets  messages  from  all  its  edges,  it  then  chooses  the  minimum 
weight  from  these  messages,  sends  that  to  the  previous  preferred  edge,  and  chooses  as  a  new 
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preferred  edge  that  which  brought  it  its  minimum  weight. 

Several  points  should  be  noted  about  this  algorithm.  First,  it  can  be  easily  seen  that  the 
result  of  one  execution  will  not  necessarily  produce  the  actual  shortest  path  tree.  Secondly, 
the  number  of  messages  used  is  basically  twice  the  number  of  edges  in  the  graph.  Thirdly, 
messages  have  an  outward  wave  of  activity  from  SINK,  reach  the  edges  of  the  spanning 
tree,  and  then  produce  an  inward  wave  towards  SINK,  during  which  best  edges  based  on  es¬ 
timated  minimum  weight  paths  are  chosen.  This  activity  is  similiar  to  that  of  echo  algor¬ 
ithms,  which  we  introduce  in  Chapter  4.  Segall’s  algorithm  presumes  a  spanning  tree,  while 
echo  algorithms  dynamically  induce  a  spanning  tree  which  supports  the  two-phase  activity. 
The  common  observation  is  that  in  a  decentralized  algorithm,  knowledge  of  termination  can 
be  difficult  to  obtain,  and  these  are  simple  techniques  which  economically  achieve  the 
desired  goal. 

2.10.  Dijkstra  and  Scholten 

Dijkstra  and  Scholten  [DIJK  79b]  have  recently  presented  a  method  of  signalling  the  ter¬ 
mination  of  a  diffusion  computation  for  a  network  of  machines,  represented  as  a  directed 
graph,  in  which  there  is  a  distinguished  node,  called  the  environment ,  with  no  in-edges.  Let 
it  initiate  a  finite  series  of  messages  to  its  successors,  which  may  in  turn  send  messages  on 
their  out-edges.  The  computation  terminates  when  no  more  messages  are  sent.  The  general 
signalling  scheme  assumes  that  nodes  may  receive  signals  which  travel  in  the  direction  oppo¬ 
site  to  that  of  messages. 

In  the  total  computation,  each  edge  will  have  carried  as  many  messages  as  signals. 
Define  the  deficit  of  an  edge  to  be  the  number  of  messages  sent  less  the  number  of  signals 
received  on  that  edge.  Further  define  C  to  be  the  sum  of  the  deficits  of  the  incoming  edges 
of  a  node,  and  D  to  be  the  sum  of  the  deficits  of  its  outgoing  edges.  This  leads  to  the  invari¬ 
ants: 

P0:  each  edge  has  a  non-negative  deficit 

PI:  C  is  non-negative 

P2:  D  is  non-negative 

P3:  C>0  OR  D=0 

Then  the  act  of  signalling  is  guarded  by  the  Boolean  G:  C>1  OR  (C=l  AND  D=0). 
Essentially,  these  invariants  guarantee  the  truth  of  the  proposition  that  following  the  end  of 
computation,  all  the  edges  will  have  signalled,  and  that  the  environment  will  return  to  a  neu¬ 
tral  state,  in  which  C  =  0  AND  D  =  0. 

To  show  that,  after  the  environment  has  returned  to  a  neutral  state,  the  diffusion  compu¬ 
tation  has  indeed  ended,  Dijkstra  introduced  the  additional  invariant: 

P4:  all  engaged  nodes  (with  C>0  OR  D>0)  are  reachable  from  the  environment  via 
directed  paths  whose  edges  all  have  positive  deficits. 

By  observing  that  there  is  an  edge  into  a  node  which  is  the  first  along  which  a  message 
came  to  it,  that  there  is  only  one  such  first  edge  for  each  node,  one  can  show  the  existence 
of  a  tree  rooted  at  the  environment,  to  which  each  engaged  node  belongs.  The  edges  of  this 
tree  are  the  paths  which  imply  P4,  and  this  in  turn  verifies  the  truth  of  the  proposition  that 
the  return  to  neutral  of  the  environment  is  the  end  of  the  computation. 
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This  paper  extends  the  method  of  programming  calculus  presented  by  guarded  com¬ 
mands  [DIJK  75]  to  a  distributed  environment.  The  signalling  technique  used  is  similiar  to 
the  method  we  use  for  detecting  the  end  of  a  parallel  traversal  of  a  graph.  We  shall  discuss 
this  in  more  detail  in  Chapter  4. 

2.11.  Conclusions 

We  see  some  common  threads  running  through  the  literature.  The  first  issue  is  that  of 
starting.  It  would  be  most  general  to  assume  that  all  processes  are  equal.  In  Dijkstra’s  sig¬ 
nalling  scheme,  this  is  not  so:  a  distinguished  node  is  assumed.  However,  there  is  a  more 
fundamental  problem.  Do  we  assume  that  all  processes  start  in  unison  to  accomplish  some 
decentralized  algorithm,  or  do  we  assume  that  any  node  may  initiate  the  algorithm?  If  the 
latter,  what  if  all  nodes  choose  to  initiate  the  algorithm  by  coincidence?  This  we  call  the 
START-UP  problem,  which  will  be  addressed  in  more  detail. 

The  next  issue  is  that  of  speed.  Rosenstiehl’s  model  assumed  a  synchronous  behaviour 
for  all  finite-state  machines.  Yet  in  asynchronous  systems,  if  speed  differences  are  too  large, 
it  becomes  difficult  to  make  any  meaningful  statements  about  the  expected  execution  time  of 
the  algorithm.  In  fact,  if  some  part  of  the  system  is  very  slow,  it  can  cause  the  entire  distri¬ 
buted  system  to  be  unacceptable. 

Communications  is  the  common  aspect  of  all  these  algorithms,  and  message-passing  the 
basic  tool.  We  will  now  go  on  to  an  example  of  decentralized  control  which  will  motivate 
the  need  for  a  decentralized  mutual  exclusion  mechanism,  and  consider  first,  for  simplicity, 
the  configuration  of  a  circle  of  processes. 
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Chapter  3 


The  Circular  Configuration 


3.1.  The  n-philosopher’s  Problem 

The  well-known  five  philosophers  problem,  first  presented  by  Dijkstra  [DIJK  71],  served 
to  introduce  the  concept  of  a  semaphore  as  a  mutual  exclusion  mechanism  for  synchronizing 
cooperating  sequential  processes  in  a  pseudo-concurrent  environment.  His  solution  avoids 
deadlock  and  starvation  by  the  use  of  weak  central  control:  semaphores  and  global  state 
variables.  We  will  take  a  variant  of  this  problem  and  use  it  to  motivate  a  decentralized 
solution,  which  in  turn  introduces  the  need  for  a  decentralized  mutual  exclusion  mechanism. 

For  generality,  let  us  consider  the  case  of  ^-philosophers.  Briefly,  they  sit  facing  each 
other  around  a  table,  with  a  fork  between  each  pair.  Philosophers  alternate  between  think¬ 
ing  and  eating,  but  each  philosopher  needs  both  his  left  fork  and  his  right  fork  in  order  to 
eat.  Obviously  they  cannot  all  be  eating  simultaneously.  If  every  philosopher  were  to  pick 
up  a  left  fork  simultaneously  (or  equivalently,  a  right  fork),  then  they  will  be  stuck  in 
deadlock.  We  need  a  program  for  each  philosopher  which  will  deal  with  this  problem,  and 
still  avoid  the  possibility  of  starvation ,  a  sequence  of  events  in  which  a  hungry  philosopher 
never  gets  to  eat.  It  will  be  necessary  to  introduce  three  states  for  each  philosopher:  think¬ 
ing,  hungry  and  eating. 

3.1.1.  Dijkstra’s  Solution 

The  method  outlined  in  [DIJK  71]  used  global  state  variables  for  each  philosopher  in  or¬ 
der  to  ensure  that  no  hungry  philosopher  ever  is  allowed  to  make  the  transition  to  eating  if 
either  of  his  neighbours  are  eating.  By  protecting  the  test  and  transition  with  a  global  sema¬ 
phore,  he  ensures  that  only  one  philosopher  at  a  time  can  go  from  hungry  to  eating. 

3.1.2.  Courtois’  Solution 

Courtois  [COUR  77]  introduced  a  technique  for  distributed  control  which  used  al¬ 
lowance  counters  and  designated  certain  philosophers  as  controllers  to  avoid  deadlock  and 
starvation.  The  problem  with  his  method  is  that  an  Euler  cycle  of  controllers  must  be 
avoided,  a  priori.  This  requires  a  predetermined  configuration  of  processors,  which  is  a 
form  of  strong  central  control. 

3.1.3.  The  Decentralized  Model 

Let  us  consider  the  problem  in  a  decentralized  context.  Dijkstra’s  method  allowed  a  phi¬ 
losopher  to  either  get  no  forks,  or  both  of  them.  However,  this  required  global  variables  pro¬ 
tected  by  a  system-wide  semaphore.  We  will  instead  consider  each  philosopher  to  be  an  in¬ 
dependent  agent  who,  on  becoming  hungry,  reaches  for  his  left  and  right  forks,  in  no  parti¬ 
cular  order.  We  further  assume  that  each  fork  has  an  arbiter  mechanism,  which  resolves 
simultaneous  requests  by  giving  the  fork  to  one  neighbour,  and  putting  the  other  in  a  wait 
state.  When  the  fork  is  freed,  the  wailing  philosopher  will  be  guaranteed  to  get  the  fork. 

Define  a  critical  state  of  a  philosopher  as  that  hungry  state  in  which  a  philosopher  is 
holding  one  fork  and  waiting  for  another.  It  follows  immediately  that  the  philosophers  are 
deadlocked  if  and  only  if  they  are  all  in  a  critical  state.  We  will  allow  deadlock  to  occur, 
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and  we  now  construct  a  decentralized  method  for  detecting  and  resolving  it.  This  approach 
permits  maximum  spontaneous  activity,  requiring  coordination  only  when  everyone  is  stuck. 

Implicit  to  decentralized  algorithms  is  the  need  to  pass  messages.  Thus,  we  must  assume 
a  message-passing  facility  among  philosophers  which  functions  independently  of  the  states  of 
the  philosophers,  and  has  access  to  the  state  of  each  philosopher.  We  model  the  philosophers 
by  a  circle  of  processors,  connected  by  communication  links  which  carry  messages  in  a  con¬ 
sistent  direction.  Each  philosopher  is  implemented  as  a  process  at  a  node,  but  each  node 
also  supports  a  control  process  which  handles  messages  and  has  access  to  the  state  of  its  phi¬ 
losopher.  Forks  are  modelled  as  resources  which  are  shared  by  two  adjacent  nodes. 

3. 1. 3. 1.  The  Detection  Message 

Observe  that  a  message  which  visits  every  node  has  access  to  the  state  of  every  philoso¬ 
pher,  and  furthermore  that  the  system  is  not  deadlocked  if  there  is  some  philosopher  which 
is  not  in  a  critical  state.  Thus,  let  us  assume  the  existence  of  a  circulating  Detection  Message 
which  visits  each  philosopher  in  turn.  If  any  philosopher  is  not  in  a  critical  state,  the  Detec¬ 
tion  Message  is  stamped  with  “there  exists  a  philosopher  who  is  not  critical”.  Clearly,  a  phi¬ 
losopher  sending  out  a  Detection  Message  in  a  deadlock  situation  will  get  it  back  unstam¬ 
ped.  This  is  how  deadlock  can  be  detected,  and  leads  to  the  first  Proposition: 

Proposition  3.1.  Deadlock  is  present  if  an  originator  of  a  Detection  Message  gets  it  back  un¬ 
changed.  □ 

Note  that  the  converse  is  not  true.  A  Detection  Message  may  lag  behind  the  cycle  of 
consumption  such  that  it  encounters  only  philosophers  in  critical  states.  However,  it  follows 
that  if  the  originator  gets  its  Detection  Message  back  unchanged  in  this  case,  the  wave  of 
consumption  must  have  preceded  the  return  of  the  message.  In  other  words,  the  originating 
philosopher  will  have  consumed  by  the  time  it  thinks  that  deadlock  is  present.  A  Consump¬ 
tion  Sequence  Number,  incremented  each  time  a  philosopher  eats,  affixed  to  the  Detection 
Message,  will  avoid  any  unnecessary  action.  Thus,  the  Detection  Message  has  the  following 
property: 

Proposition  3.2.  If  a  philosopher  gets  its  Detection  Message  back  unchanged,  it  has  either 
detected  deadlock  or  it  has  consumed  following  the  sending  of  its  message.  □ 

3. 1.3.2.  The  Relinquish  System 

The  next  questions  are:  who  issues  the  Detection  Message,  and  what  happens  when 
deadlock  is  detected?  The  answers  to  both  lie  in  the  same  mechanism,  the  Relinquish  Sys¬ 
tem.  In  deadlock,  every  fork  is  held  and  being  waited  for,  and  furthermore,  every  philoso¬ 
pher  holds  a  fork  and  is  waiting  for  another.  Thus,  if  some  philosophers  give  up  the  forks 
they  hold,  then  by  virtue  of  the  arbiter  queue  mechanism,  one  of  their  neighbours  will  be 
able  to  consume.  We  do  not  wish  everyone  to  relinquish,  for  then,  no  one  consumes,  and  a 
sequence  of  deadlock  and  relinquish  can  lead  to  perpetual  starvation. 

We  introduce  a  Relinquish  Token,  and  call  a  philosopher  holding  a  Relinquish  Token  an 
R-philosopher.  Furthermore,  let  the  philosopher  detecting  deadlock  issue  a  Relinquish  Mes¬ 
sage,  which,  upon  coming  to  an  R-philosopher,  causes  it  to  relinquish  its  fork.  If  the  system 
has  more  than  one  R-philosopher,  then  there  is  a  first  R-philosopher  who  relinquishes  its 
fork.  This  may  propagate  a  wave  of  consumption  such  that,  given  the  asynchrony  of  the  sys- 
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tern,  a  subsequent  R-philosopher  may  have  consumed  by  the  time  the  Relinquish  Message 
gets  to  it.  Thus,  we  observe  that  the  Relinquish  Message  should  cause  R-philosophers  in  cri¬ 
tical  states  to  give  up  their  forks. 

Clearly,  it  is  only  a  philosopher  who  enters  a  critical  state  who  needs  to  be  concerned 
with  the  detection  of  deadlock.  However,  we  note  that  all  such  philosophers  must  issue  a 
Detection  Message.  If  only  a  subset  of  philosophers  entering  critical  states  send  out  Detec¬ 
tion  Messages,  and  there  is  a  philosopher  not  in  a  critical  state,  they  will  not  detect 
deadlock.  Then  if  this  last  philosopher  enters  its  critical  state  bat  is  not  a  member  of  the 
subset  which  sends  Detection  Messages,  deadlock  will  never  be  found.  Hence,  all  philoso¬ 
phers  entering  critical  states  must  issue  a  Detection  Message.  Following  the  detection  of 
deadlock,  the  maximum  number  that  can  be  consuming  is  only  (n  div  2).  Thus,  if  we  place 
Relinquish  Tokens  at  alternate  philosophers,  we  get  the  possibility  of  maximum  activity  fol¬ 
lowing  deadlock  (if  the  Relinquish  Message  is  sufficiently  fast). 

3. 1.3.3.  Starvation 

So  far,  we  have  not  been  very  precise  about  the  meaning  of  relinquish.  Consider  an  R- 
philosopher  in  a  critical  state  owning  a  held  fork  and  waiting  for  a  requested  fork,  faced 
with  a  Relinquish  Message.  There  are  three  possible  definitions  of  relinquish : 

a)  Give  up  both  the  held  fork  and  the  request  for  the  other  fork.  In  other  words,  go  back 
to  thinking. 

b)  Give  up  the  held  fork,  but  maintain  the  request  for  the  other  fork. 

c)  Give  up  the  held  fork,  maintain  the  request  for  the  other  fork,  and  in  addition  request 
the  previously  held  fork  also. 

If  (c)  is  used,  then  the  R-philosopher  will  be  guaranteed  to  eat,  following  deadlock,  for 
his  neighbours  eventually  give  up  the  forks  they  hold.  If  (a)  is  used,  then  all  philosophers  can 
proceed  to  a  thinking  state,  and  then  to  a  deadlock  state,  without  an  R-philosopher  having 
consumed.  The  R-philosophers  can  starve,  for  their  lives  consist  of  getting  a  fork,  then  for¬ 
feiting  it.  Similiarly,  if  (b)  is  used,  all  R-philosophers  can  obtain  the  forks  they  wait  for,  say 
their  right  forks,  but  the  others,  having  consumed,  can  obtain  their  right  forks  as  well.  Then 
the  philosophers  are  deadlocked  again,  without  the  R-philosophers  having  yet  consumed. 
Following  (b)  again  can  lead  to  an  infinite  sequence  of  relinquishing  for  R-philosophers, 
which  is  starvation. 

In  order  to  motivate  the  need  for  mutual  exclusion,  and  to  study  the  more  difficult  prob¬ 
lems  of  decentralized  control,  let  us  adopt  definition  (a).  To  avoid  starvation,  we  must 
prevent  the  same  philosopher  from  holding  an  R-token  in  successive  deadlocks.  Two 
mechanisms  are  needed:  the  first  is  that  a  Relinquish  Message  must  cause  an  R-token  to  be 
passed  to  a  neighbour.  The  second  is  that  we  must  ensure  that  only  one  Relinquish  Message 
per  deadlock  is  issued.  It  is  clear  that  if  an  arbitrary  number  are  issued,  the  same  philoso¬ 
pher  can  continually  remain  an  R-philosopher. 

We  can  now  complete  the  definition  of  relinquishing.  An  R-philosopher  will,  upon 
receiving  a  Relinquish  Message,  pass  his  Relinquish  token  to  his  neighbour  (in  a  consistent 
direction).  In  addition,  if  he  is  in  a  critical  state,  he  will  return  to  a  thinking  state,  giving  up 
both  his  held  fork  and  his  outstanding  request. 

Guaranteeing  that  only  one  Relinquish  Message  per  deadlock  is  issued,  when  there  are 
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several  philosophers  who  have  detected  deadlock,  will  require  an  agreement  among  them  as 
to  who  will  issue  it.  This  implies  a  mutual  exclusion  mechanism,  one  which  is  decentralized 
and  works  by  agreement  among  a  number  of  contending  processes.  We  will  present  such  a 
mechanism  in  the  sections  that  follow. 

3.1.4.  Summary 

1.  Any  philosopher  entering  a  critical  state  issues  a  Detection  Message  carrying  his  Con¬ 
sumption  Sequence  Number. 

2.  Any  philosopher  in  a  critical  state  getting  a  Detection  Message  passes  it  on;  in  a  non- 
critical  state,  stamps  it  OK. 

3.  A  philosopher  getting  back  its  Detection  Message  unchanged  detects  deadlock  if  its 
Consumption  Sequence  Number  is  still  the  same  as  the  one  on  its  Detection  Message. 

4.  The  mutual  exclusion  mechanism  allows  only  one  philosopher  to  emit  a  Relinquish 
Message. 

5.  An  R-philosopher  receiving  a  Relinquish  Message  goes  back  to  the  thinking  state,  giv¬ 
ing  up  his  fork  and  his  request.  He  then  passes  on  his  R-token,  in  a  consistent  direc¬ 
tion. 

6.  A  philosopher  who  consumes  increments  his  Consumption  Sequence  Number  by  one. 

The  ^-philosophers  problem  has  been  used  for  two  reasons.  First,  we  have  taken  a  fami¬ 
liar  problem  and  put  it  into  the  context  of  a  distributed  system,  creating  a  decentralized 
solution  which  uses  no  global  resources.  Secondly,  we  have  used  it  to  motivate  the  need  for 
a  mutual  exclusion  mechanism,  which  must  be  decentralized.  It  may  be  argued  that  the  use 
of  R-tokens  introduces  an  inequality  among  processes.  However,  unlike  Courtois’  control¬ 
lers,  there  is  not  a  fixed  hierarchy  of  processes.  Every  philosopher  is  essentially  equivalent, 
with  an  R-token  inducing  a  temporary  difference  in  state. 

3.1.5.  Reliability  Considerations 

Let  us  consider  this  algorithm  in  the  face  of  failures.  A  line  failure  would  disrupt  the  cir¬ 
cular  configuration  for  which  this  algorithm  is  designed.  Although  we  introduce  techniques 
later  for  dealing  with  general  graph  configurations,  this  algorithm  is  confined  to  circles  of 
processes.  Thus,  until  communication  is  restored,  the  algorithm  must  remain  suspended.  If  a 
philosopher  fails  but  the  communications  between  its  neighbours  remain  intact  through  it, 
there  are  several  cases  which  arise. 

First  assume  no  message  loss  was  associated  with  the  failure.  If  the  philosopher  was  in 
the  thinking  state,  he  will  not  be  competing  for  his  adjacent  forks,  and  the  system  will  no 
longer  deadlock.  If  he  was  holding  one  or  more  forks  and  does  not  give  them  up  in  failing, 
then  one  or  both  neighbours  never  get  these  fork(s),  and  starvation  will  result.  On  the  other 
hand,  if  failing  includes  giving  up  his  forks,  then  the  system  will  come  to  a  no-deadlock 
state. 

The  loss  of  a  message  with  a  philosopher  implies  that  the  originator  of  the  message  must 
time-out,  for  it  waits  in  vain  for  its  message  to  return.  Another  message  can  then  be  sent 
out.  It  is  clear  that  the  loss  of  a  philosopher  changes  completely  the  nature  of  the  problem, 
for  deadlock  will  not  occur  again,  while  starvation  becomes  a  distinct  possibility.  Since  we 
have  precluded  message  losses  alone,  the  algorithm  is  in  a  sense  robust  as  long  as  the  system 
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is  intact.  We  note,  however,  that  the  unique  character  of  the  ^-philosophers  problem  is  very 
sensitive  to  node  and  line  failures. 

This  particular  problem  has  been  chosen  as  a  vehicle  for  the  understanding  of  some 
aspects  of  decentralized  control.  As  such,  reliability  is  not  one  of  the  critical  components  of 
the  algorithm.  In  Chapter  6,  a  general  approach  to  the  problem  of  process-resource  systems 
is  given.  A  more  complete  analysis  of  reliability  issues  for  the  decentralized  control  of  such 
systems  will  be  found  there. 

3.2.  Decentralized  Mutual  Exclusion 

The  classical  studies  of  cooperating  sequential  processes  [DIJK  71,  DIJK  68,  BRIN  72, 
HOAR  74]  have  introduced  the  notion  of  mutual  exclusion  as  a  property  which  is  necessary 
to  the  orderly  and  harmonious  interaction  of  processes  which  use  common  resources.  Thus, 
if  two  processes  A  and  B  wish  to  use  the  same  resource  X,  a  mutual  exclusion  mechanism, 
mutex,  will  allow  either  A  or  B  (but  not  both),  access  to  X  at  any  particular  time.  The 
natural  extension  of  this  idea  is  to  a  set  of  processes  contending  for  a  common  resource.  The 
mutual  exclusion  mechanism  would  allow  them  to  use  it  one  at  a  time.  This  idea  and  its 
variations  have  been  studied  by  many  authors  in  the  context  of  pseudo-concurrent  computing 
environments,  and  constructs  such  as  semaphores  [DIJK  68],  critical  regions  [BRIN  72], 
monitors  [HOAR  72]  have  been  used  to  provide  mutual  exclusion  in  concurrent  systems.  As 
LeLann  has  pointed  out  [LELA  78],  mutual  exclusion  of  conflicting  events  is  also  a  mechan¬ 
ism  for  avoiding  deadlock  in  systems  which  use  incremental  locking  and  waiting  in  the  exe¬ 
cution  of  processes.  We  will,  in  this  thesis,  tend  to  emphasize  mutual  exclusion  more  as  a 
method  of  sequencing  those  events  which  would  otherwise  produce  erroneous  results,  than  as 
a  deadlock  avoidance  method.  We  will  later  show  that  there  are  methods  other  than  locking 
for  correctly  sequencing  a  number  of  conflicting  events.  In  general  process  resource  systems 
in  which  deadlock  is  possible,  avoidance  techniques  may  be  satisfactory  if  pre-claiming  of 
resources  is  used.  However, if  resources  are  allocated  dynamically,  then  we  prefer  deadlock 
detection  and  resolution  methods.  Thus,  Chapter  6  presents  algorithms  for  deadlock  detec¬ 
tion  in  process  resource  interactions  modelled  by  directed  graphs. 

The  need  for  mutual  exclusion  arises  when  several  processes  contend  for  a  common 
resource.  It  is  important  to  point  out  that  processes  operating  on  disjoint  resources  should 
proceed  in  parallel,  especially  in  distributed  systems.  Thus,  strictly  local  events  at  various 
nodes  can  execute  simultaneously.  Events  which  share  the  same  resource  can  also  be  disjoint 
in  time.  Therefore,  only  those  concurrent  activities  on  common  objects  must  be  sequenced. 
A  critical  question  in  considering  mutual  exclusion  mechanisms  is  the  granularity  of  the 
shared  resource.  In  the  context  of  file  updates,  for  example,  the  resource  could  either  be  the 
whole  file,  or  a  set  of  records  satisfying  some  logical  predicate,  or  a  particular  data  item  in  a 
particular  record.  We  want,  as  much  as  possible,  to  require  mutual  exclusion  on  small 
resources,  rather  than  on  entire  files.  Work  done  on  the  SDD-1  system  by  Bernstein  et  al 
[BERN  77]  has  suggested  a  method  of  predetermining  when  updates  may  conflict,  and  when 
they  have  no  intersection.  The  decentralized  mutual  exclusion  mechanisms  we  present  are 
assumed  able  to  check  if  concurrent  events  conflict  in  their  semantics.  If  they  do,  then  they 
are  scheduled  in  some  sequence,  and  if  not,  they  operate  transparently  with  respect  to  each 
other.  Another  aspect  of  the  granularity  of  mutual  exclusion  mechanisms  is  the  extent  of 
activity  which  must  occur  exclusively,  within  a  critical  region.  Clearly,  those  mechanisms 
which  require  only  a  small  number  of  operations  within  a  critical  region  allow  more 
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concurrent  activity  than  those  with  large  critical  regions.  The  exact  exact  characterization 
of,  and  further  research  into  granularity  mechanisms  is  beyond  the  scope  of  this  thesis. 

We  will  introduce  several  election  mechanisms  which  can  be  used  to  implement,  in 
simple  ways,  decentralized  mutual  exclusion  in  a  circular  configuration.  For  large  systems, 
it  is  not  clear  that  election  is  the  best  way  to  provide  mutual  exclusion,  because  of  the  large 
number  of  messages  involved.  Certainly,  a  pre-determined  hierarchy  of  processors  will  serve 
the  same  end,  but  this  violates  our  requirement  for  decentralized  control.  The  use  of  circu¬ 
lating  control  tokens  and  tickets  [LELA  78]  is  another  way  of  ensuring  that  processes  which 
operate  on  common  objects  proceed  in  the  same  sequence  at  each  node.  In  a  sense,  these  are 
global  objects,  though  distributed  in  that  they  circulate.  In  spite  of  being  global  objects, 
however,  they  have  the  nice  property  of  being  robust. 

Given  the  necessity  in  distributed  systems  for  mutual  exclusion  mechanisms,  we  must 
depend  on  a  message  passing  technique  since  we  do  not  assume  any  global  resources.  In  par¬ 
ticular,  we  do  not  even  allow  ourselves  the  robust  distributed  global  devices  of  control  tok¬ 
ens  and  tickets  which  LeLann  introduced  [LELA  78],  Rather,  we  start  from  an  approach 
based  on  earlier  work,  also  by  LeLann  [LELA  77].  Let  us  briefly  review  the  major  points  of 
his  election  algorithm.  He  considers  a  model  of  processes  in  a  circular  network  which 
operate  by  using  a  single  circulating  control  token.  If  this  token  should  be  lost,  then  an 
election  must  take  place  to  appoint  a  single  process  to  emit  another  control  token  for  the 
network.  It  is  important  that  only  one  process  be  elected  among  the  many  which  sense  the 
absence  of  the  control  token.  Each  node  which  times  out  sends  out  a  candidate  token  bear¬ 
ing  the  unique  name  of  the  node.  A  node  remains  in  the  election  as  long  as  it  does  not  see  a 
control  token  or  a  smaller  candidate  token.  The  node  which  gets  its  candidate  token  back 
and  is  still  in  the  election  emits  the  new  control  token.  Clearly,  it  is  the  smallest  of  the  con¬ 
tenders.  Without  loss  of  generality,  the  highest  contender  could  have  been  chosen  to  equal 
effect.  In  this  thesis,  we  use  the  convention  of  finding  the  highest  node,  in  the  election  algor¬ 
ithms  which  follow.  Typically,  LeLann’s  algorithm  uses  k  nodes  out  of  n  to  hold  the  elec¬ 
tion.  Since  each  candidate  token  circulates  at  least  once,  the  number  of  message  passes 
needed  is  nk,  where  a  message  pass  is  the  movement  of  a  message  from  one  node  to  its 
neighbour,  and  n  is  the  number  of  nodes  in  the  circle.  The  algorithm  requires  execution 
time  of  at  most  n,  from  the  time  the  node  which  is  going  to  be  elected  sends  out  its  candi- 
date  token.  We  note  that  if  all  nodes  participate  in  the  election,  the  number  of  message 
passes  is  n2. 

In  the  rest  of  this  chapter,  we  deal  only  with  circular  configurations,  in  order  to  focus  on 
the  nature  of  the  mutual  exclusion  algorithms.  We  will  first  describe  an  improvement  to 
LeLann’s  algorithm,  and  then  give  some  variations.  In  particular,  we  will  apply  it  to  the 
multiple-copy  file  update  problem.  In  Chapter  4,  we  will  use  a  parallel  graph  traversal  tech¬ 
nique  to  find  an  algorithm  for  decentralized  mutual  exclusion  in  arbitrary  configurations  of 
processes. 

3.2.1.  The  Extinction  Algorithm 

Assume  a  set  of  n  autonomous  processors  which  are  interconnected  in  a  circle.  Let  them 
be  given  names  uniquely  from  1  to  n,  so  that  the  rank  of  a  node  is  taken  as  its  name,  and  let 
us  assume  a  message  passing  facility  which  allows  messages  to  move  in  a  consistent  direc¬ 
tion,  say  clockwise.  Each  node  therefore  always  receives  messages  from  one  neigbour  and 
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passes  messages  to  another.  Let  there  be  no  global  clock,  shared  global  memory,  or  a  desig¬ 
nated  controller  process  Furthermore,  assume  that  no  single  node  knows  how  many  nodes 
there  are  in  the  system,  but  is  aware  only  of  the  identities  of  its  neighbours.  In  a  previous 
paper  [CHANG  79],  we  have  presented  an  improvement  to  LeLann’s  algorithm,  which  redu¬ 
ces  the  number  of  message  passes  to  O  {n  log«),  on  the  average.  It  is  based  on  the  principle 
of  extinction,  whereby  a  message  from  a  node  which  has  no  chance  of  being  the  largest  is 
suppressed.  Call  each  message  a  ballot,  and  call  the  algorithm  to  find  the  largest  an  election. 
Consider  this  Basic  Extinction  Election  briefly. 

Algorithm  3.1.  Basic  Extinction  Election 

Let  the  message  sent  out  by  node  /  be  called  ballot-i,  and  let  all  ballots  be  passed 
clockwise.  Consider  a  node  i  receiving  ballot-/: 

1.  if  (/  =  /)  then  i  is  the  largest  node 

2.  if  (/  <  /)  then  send  ballot-/  on. 

Implicitly,  if  (/  >  /),  then  nothing  at  all  is  done,  since  node  /  cannot  be  the  largest. 
Hence  ballot-/  is  extinguished.  It  is  easy  to  see  that  all  ballots  meet  a  higher  node  before 
getting  back  to  their  originators,  except  for  the  ballot  from  the  largest  node. 

3.2.2.  Start-up  Variant 

Recall  that  we  can  characterize  a  decentralized  algorithm  by  P #,  its  Participation  Num¬ 
ber,  the  number  of  processes  allowed  to  participate  in  the  algorithm,  and  S#,  its  Source 
Number,  the  number  of  processes  allowed  to  start  a  single  execution  of  the  algorithm.  In 
this  algorithm,  P#  is  n,  the  number  of  nodes  in  the  system,  and  S#  is  also  n.  However,  S# 
only  indicates  the  number  that  can  join  in  simultaneously.  We  cannot  assume  that  by  magic 
all  nodes  will,  at  the  same  time,  decide  to  send  a  ballot  around.  Instead,  one  or  more  may 
start,  but  at  least  one  will  start.  To  handle  the  problem  of  staggered  starting,  we  let  a  ballot 
arriving  at  a  node  cause  that  node  to  consider  joining  the  election,  if  it  has  not  already  sent 
out  its  ballot. 

Algorithm  3.2.  Start-up  Variant 

Let  each  node  have  a  status,  either  awake  or  asleep.  Each  node  is  initially  asleep.  A 
node  which  starts  spontaneously  turns  itself  awake,  and  sends  out  its  ballot.  Consider  now  a 
ballot-/  arriving  at  node  /': 

1.  If  node  i  is  asleep,  turn  itself  awake.  Now,  if  ( i>j ),  then  send  out  ballot-/'.  If  ( i<j ), 
ballot-/  is  forwarded.  Note  that  i  ^  j,  since  i  was  asleep:  ballot-/  could  not  have  ori¬ 
ginated  from  /'. 

2.  If  node  /'  is  awake,  then  if  (/'  =  /),  then  node  /'  is  the  largest.  If  (/  </),  send  ballot-/  on. 
Otherwise  do  nothing. 

At  the  beginning  of  every  election,  all  nodes  are  assumed  asleep.  Thus,  the  elected  pro¬ 
cess  has  the  responsibility  of  turning  the  processes  asleep  after  the  election  is  over,  so  that  a 
new  election  can  take  place.  This  can  be  done  with  a  clear  message. 

3.2.3.  Performance  Analysis 

We  are  interested  in  three  measures:  storage  needed  at  a  node,  communication  time  and 
number  of  message  passes. 
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3.2.3. 1.  Space  Behaviour 

We  assume  that  a  node  can  get  one  message  at  a  time.  Thus,  since  each  message  only 
need  carry  the  identity  of  its  originator  and  a  type,  we  need,  for  n  nodes  and  two  message 
types,  ( log  n  +  1)  bits. 

3.2.3.2.  Time  Behaviour 

We  are  interested  in  communication  time,  and  we  assume  that  the  time  from  one  node 
to  another  is  approximately  unity,  and  process  time  at  a  node  is  negligible.  The  algorithm 
succeeds  when  the  highest  is  found.  If  all  processes  start  simultaneously,  then  the  time  is 
simply  n.  If  the  highest  node  were  the  first  to  start,  its  transit  around  the  circle  would  also 
take  n  time.  However,  if  the  node  furthest  from  n  started,  its  message  would  take  n  —  1 
units  to  get  to  node  n,  whose  message  would  take  another  n  units  to  go  around.  In  this  case, 
the  elapsed  time  is  2/7  —  1 .  In  all  cases,  the  time  taken  is  <  2 n. 

3.2.3.3.  Message  Passes. 

Let  us  consider  the  Basic  Extinction  Election.  In  the  best  case,  the  nodes  are  arranged  in 
increasing  sequence  in  the  direction  of  message  flow,  so  that  each  ballot  except  ballot-«  only 
goes  to  one  node,  i.e.,  is  passed  once.  There  are  n  —  1  of  these  nodes,  and  ballot-«  goes  n 
times.  Thus,  the  total  number  of  message  passes  is  2n  —  1. 

In  the  worst  case,  the  nodes  are  arranged  in  decreasing  order  in  the  direction  of  message 
flow,  so  that  ballot-/  must  be  passed  /  times.  The  total  number  of  message  passes  is 

n(n  + 1) 

2 

=  0(n2) 

Assume  without  loss  of  generality  that  message  flow  is  clockwise,  and  that  all  possible 
circular  configurations  are  equally  likely.  In  the  average  case,  let  P{i,k )  be  the  probability 
that  ballot-/  is  passed  k  times,  which  is  the  probability  that  the  A:  — 1  clockwise  neighbours  of 
/  are  less  than  /  AND  the  kth  clockwise  neighbour  of  /  is  larger  than  /.  There  are  /  —  1  nodes 
less  than  i.  The  probability  that  the  k  —  1  clockwise  neighbours  of  /  are  less  than  /  is  the 
number  of  ways  of  choosing  k—  1  things  from  /  —  I  things  smaller  than  /,  over  the  total  num¬ 
ber  of  ways  of  choosing  k  —  1  things  out  of  n  —  1  things.  Writing  C(a.b)  as  the  number  of 
ways  of  choosing  b  things  from  a  things,  this  is: 


/-l 

C(i-\,k-\) 

A:  —  1 

I 

I 

T 

T 

U 

n-l 

k  —  1 

The  probability  that  the  A:th  clockwise  neighbour  of  /  is  larger  than  /,  given  that  there  are 
n  —i  things  larger  than  /,  is  simply  the  number  of  ways  of  choosing  one  such  larger  thing 
over  the  number  of  ways  of  choosing  one  of  the  n  —k  things  left  after  k  things  have  already 
been  placed.  This  is 


n  —i 
n  —k 
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Thus 


P(i.k) 


C(i  —  \,k  —  l)  x  n  —  i 
C(n—\,k  —  \)  n—k 


Knowing  that  ballot-//  always  takes  n  passes  and  that  there  is  only  one  such  ballot,  we 
can  consider  the  remaining  n  —  1  ballots,  each  making  at  most  n—  1  passes.  The  expected 
number  of  message  passes  for  ballots  other  than  ballot-//  is 


hik)  =  'is 


k  P(i.k) 


k  =  1 


i  *  n 


Therefore,  the  expected  number  of  message  passes,  for  all  ballots  is 


E(k)  =  n  +  ^ 
(  =  1 


^k  P(i,k ) 

l  =  1 


This  can  be  simplified  to  [FELL  62]: 

Eik)  -  *  +  %TT 7 

=  «d+V  +  T+  •••  + ^ 

2  3  n 


=  n  H(n) 

where  H (/? )  is  the  nth  harmonic  number.  This  is  0{n  log//).  The  details  of  this  derivation 
can  be  found  in  Appendix  I.  A  simpler,  but  informal,  argument  is  to  say  that  for  any  node  /' 
there  are  n  —  i  nodes  greater  to  or  equal  to  /',  which  are  on  the  average  distributed  such  that 
the  circle  contains  n—i  equal  segments,  each  of  length  n/(n—i).  Thus,  ballot-/  will  go  on 
the  average  n  /( n  —i)  edges  before  it  meets  one  of  these  nodes.  For  all  i,  the  total  number  of 
edges  traversed  is: 

1b-\  = « #(«) 


For  the  start-up  variant,  in  the  best  case,  node  n  starts  and  its  ballot  suppresses  all  other 
ballots  before  they  get  emitted.  Thus  only  n  ballot  passes  are  needed.  In  the  worst  case, 
every  ballot  starts,  and  the  worst  case  is  the  same  as  the  worst  case  for  the  Basic  Election: 
n(n  +  l)/2.  We  cannot  analyse  the  average  case,  because  we  have  no  a  priori  feeling  for 
how  nodes  would  start  on  the  average.  However,  it  is  clear  that  having  every  node  start 
simultaneously  is  a  worst  case.  Thus,  the  Start-up  Variant,  on  the  average,  is  bounded  above 
by  O  ( n  log  «). 

3.2.4.  Reliability  Considerations 

Distributed  environments  are  often  unstable,  and  thus  it  is  important  to  consider  the 
behaviour  of  decentralized  algorithms  in  the  face  of  failures.  Modes  of  failure  can  be  com¬ 
plex  and  intractable  to  analysis.  For  example,  the  failure  of  a  link  may  produce  disjoint 
components  of  varying  sizes.  The  failure  of  several  nodes  in  succession  may  defeat  any  fail¬ 
safe  mechanism,  and  finally  failures  may  occur  in  the  detection  and  repair  mechanisms. 

We  define  a  clean  failure  of  a  node  to  be  one  in  which  the  node  vanishes  without  causing 
lost  messages  or  disrupting  communications  links.  Also,  we  will  call  an  election  by  which  a 
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node  is  elected  when  it  gets  its  own  ballot  back  a  ballot  election.  Consider  only  the  situa¬ 
tion  in  which  nodes  fail  cleanly.  If  any  node  but  node  n  fails,  then  ballot-/?  survives  and 
node  n  survives,  so  node  n  will  be  elected  correctly.  So  only  if  node  n  fails  are  we  in  trouble. 
If  node  n  fails  before  it  extinguishes  ballot-(/?  —  1),  then  node  (/?  —  1)  will  elect  itself.  In  the 
case  in  which  node  n  extinguishes  ballots  from  all  other  nodes,  and  then  itself  fails,  we  note 
that  ballot-/?  continues  to  circulate  (otherwise  node  n  would  have  been  elected).  Thus,  if 
none  of  the  nodes  elect  themselves  by  ballot,  we  must  be  able  to  depend  on  ballot-/?  still 
being  intact.  We  must  ask  how  we  might  detect  that  ballot  election  fails. 

If  any  node  is  elected  by  ballot,  no  matter  what  the  pattern  of  clean  node  failures,  it  will 
happen  in  the  first  cycle  of  that  ballot.  Assume  node  n  —  1  will  be  elected  by  ballot.  All  the 
nodes  between  n  and  n  —  1  must  encounter  ballot-/?  for  a  second  time  before  ballot-/?  gets  to 
the  newly  elected  node  /?  —  1  and  can  be  extinguished  by  it.  Therefore,  if  any  node  were  to 
see  ballot-/?  for  a  third  time,  it  must  mean  that  ballot  election  failed.  This  node  can  immedi¬ 
ately  elect  itself,  and  extinguish  ballot-/?.  This  fail-safe  election  mechanism  is  sure  of  mak¬ 
ing  a  unique  selection,  but  not  necessarily  choosing  the  largest  of  the  surviving  nodes,  given 
some  pattern  of  node  failure.  However,  whatever  the  pattern,  one  and  only  one  node  will  be 
elected. 

Suppose,  however,  that  a  link  fails.  Then  the  circle  is  disrupted,  and  communications 
cannot  go  from  node  to  node.  We  introduce  techniques  in  Chapter  4  for  dealing  with  com¬ 
munications  and  elections  in  general  graphs.  For  the  algorithms  in  Chapter  3,  however,  a 
link  failure  implies  that  the  algorithm  must  be  suspended  until  communications  are  restored. 
A  positive  acknowledgment  message  protocol  will  allow  us  to  detect  such  link  failures.  We 
now  consider  node  failures  which  are  not  clean. 

A  node  which  fails  could  be  holding  a  message  at  the  time  of  failure.  Thus,  node  failures 
may  include  message  losses.  For  the  Basic  Extinction  Election,  many  messages  are  delibera¬ 
tely  extinguished.  However,  it  is  critical  that  the  ballot  from  the  largest  node  returns  to  its 
originator.  We  have  seen  how  failure  of  the  largest  node  without  message  loss  might  be 
handled.  If  the  largest  ballot  is  lost,  however,  all  other  messages  might  be  removed,  and  then 
no  node  is  elected.  To  cope  with  this  problem  requires  a  time-out  mechanism  at  each  node, 
which  could  send  a  query  around  the  circle.  If  the  query  returns  before  any  other  messages 
are  seen,  the  node  knows  that  the  election  was  unsuccessful.  It  can  then  start  another  elec¬ 
tion.  This  technique  is  similiar  to  that  used  in  LeLann’s  election  algorithm  [LELA  77]. 

Algorithm  3.3.  The  Fail-safe  Variant 

This  algorithm  assumes  a  circular  configuration,  clean  node  failures,  and  an  intact  mes¬ 
sage  system.  It  is,  otherwise,  independent  of  the  number  and  order  of  clean  node  failures. 
Each  node  has  a  variable  big  and  a  variable  count.  Initially,  the  value  of  big  at  each  node  is 
the  node  name,  and  the  value  of  count  at  each  node  is  1.  Assume,  for  simplicity,  that  start¬ 
ing  is  simultaneous.  Consider  the  arrival  of  ballot-y  at  node  ?'.  In  the  text  of  the  description, 
the  rules  are  to  be  considered  in  order,  so  that  if  rule  1  is  false,  then  rule  2  is  tried,  and  so 
on. 

1.  If  (;'  =j)  then  the  ballot  originated  from  node  ?,  and  the  node  is  elected. 


3-10 


2.  If  ( i>j )  then  extinguish  ballot-/  Do  nothing. 

3.  If  (/>/)  and  j>big  then  j  replaces  the  previous  value  of  big,  and  count  is  set  to  1. 
Send  the  ballot  on. 

4.  If  (J>i)  but  j  <big,  the  ballot  can  be  extinguished  since  it  cannot  be  from  the  largest 
node.  Do  nothing. 

5.  If  (j>i)  and  j  =  big  then  add  1  to  count.  If  count  is  3,  then  node  /  is  elected.  If  not, 
send  ballot  on. 

Given  these  conditions  of  failure,  then,  we  have  established  a  mechanism  which  will 
satisfy 

Proposition  3.3.  The  Fail-safe  Variant  will  elect  one  and  only  one  node  unless  all  nodes  fail. 

□ 

3.2.5.  The  Multiple-Copy  Update  Problem 

We  have  reviewed  decentralized  algorithms  for  the  multiple-copy  update  problem  in 
Chapter  2.  Briefly,  if  copies  of  a  heavily-used  file  are  kept  at  all  the  nodes  of  a  network,  and 
updates  originating  from  any  of  the  nodes  need  to  be  made  as  they  occur,  then  the  problem 
consists  of  coordinating  the  updates  such  that  the  file  copies  are  consistent.  Consistency  has 
a  particular  meaning  in  this  context.  Given  a  set  of  base  variables  in  a  file  F,  with  multiple 
copies  of  F  at  many  nodes,  then  a  mechanism  M  updates  file  F  consistently  if  the  effect  of 
simultaneous  update  requests  on  the  common  base  variables,  originating  at  different  nodes, 
is  the  same  as  if  these  requests  occurred  sequentially,  with  each  copy  being  updated  in  the 
same  sequence. 

This  problem  is  fundamentally  one  of  inducing  a  total  ordering  on  the  events  needed  to 
accomplish  all  the  updates.  Simultaneous  updates  are  contending  requests  for  the  resource, 
and  must  be  granted  one  at  a  time.  Our  approach  is  based  on  election  as  a  decentralized 
mutual  exclusion  mechanism.  If  several  requests  were  simultaneously  contending  for  access 
to  the  distributed  resource,  election  ensures  that  only  one  is  allowed  to  use  it.  Requests 
which  occur  after  an  election  is  in  progress  must  wait  for  the  next  election.  Consistency 
does  not  require  that  updates  be  applied  strictly  in  the  order  of  their  arrival,  only  that  they 
be  applied  in  the  same  sequence  for  each  copy.  Thus,  if  we  used  a  clock  which  is  roughly 
accurate  at  each  node,  together  with  an  intrinsic  node  number,  we  can  generate  unique 
update  priorities  at  each  node  which  can  be  used  for  election.  This  method  will  approximate 
the  real  ordering  of  update  requests.  An  arbitrary  ordering  which  does  not  favour  any  parti¬ 
cular  node  can  be  achieved  by  using  a  random  number  generator  in  the  place  of  a  clock. 

The  basic  idea  is  that  an  election  is  held  only  among  those  nodes  which  wish  to  update. 
From  a  neutral  state,  any  node  which  spontaneously  sends  out  a  ballot  is  in  the  current  elec¬ 
tion.  However,  it  is  shutoff  by  the  arrival  of  a  ballot,  if  it  has  not  already  joined  the  election 
and  it  must  wait  till  the  next  one.  We  recall  that  we  are  still  considering  a  circular 
configuration.  The  Multiple-Copy  Algorithm  is  a  simple  version  of  the  Start-up  Variant. 

3.2.5. 1.  Multiple-Copy  Algorithm 

Let  all  nodes  initially  be  asleep ,  and  let  all  nodes  which  spontaneously  wish  to  update  in 
the  asleep  state  turn  themselves  awake,  and  send  out  their  ballots.  Now  consider  ballot-/ 
from  node  j,  with  priority  p j,  arriving  at  node  i,  which  has  priority  pt. 


3-11 


1. 


If  node  i  is  asleep ,  set  it  to  shutoff.  Node  i  wishing  to  update  following  this  puts  the 
update  in  a  queue.  Forward  ballot-y. 

2.  If  node  /  is  shutoff  then  just  forward  the  ballot. 

3.  If  (Pj  >  Pi)  then  forward  ballot-y. 

4.  If  (p j  <  p^  then  do  nothing. 

5.  If  [pj  =  p^  then  node  /  is  elected.  The  elected  node  circulates  its  update  to  all  other 
nodes.  The  update  is  performed  at  each  node,  which  passes  the  update  message  on, 
and  then  enters  the  asleep  state.  When  the  update  returns  to  the  elected  node,  it  itself 
makes  the  update,  discards  the  update,  and  enters  the  asleep  state.  A  node  which  is 
asleep  and  has  an  entry  in  its  update  queue  awakes  and  starts  an  election. 

As  long  as  our  method  of  assigning  priorities  produces  unique  values,  in  which  the  prior¬ 
ities  of  earlier  requests  are  larger,  then  starvation  will  not  occur,  since  the  algorithm  always 

elects  the  contender  with  the  highest  priority.  Thus,  roughly  accurate  clocks  will  suffice. 

3.2. 5. 2.  Sequenced  Updates 

If  there  are  several  processes  involved  in  an  election,  then  any  update  requests  which 
arrive  before  all  of  these  have  completed  will,  in  general,  have  to  wait  till  they  are  all  done. 
This  means  that  in  most  cases,  rather  than  electing  a  single  node  many  times,  it  is  better  to 
produce  an  ordering  of  competing  nodes  in  every  election.  The  node  with  highest  rank  will 
update  first,  then  the  next,  and  so  on.  The  last  node  will  issue  the  clear  message,  which  will 
permit  the  next  round  of  elections.  Note  that  if  timestamps  are  used  to  rank  nodes,  then  the 
lowest  timestamp  will  have  the  highest  priority.  We  continue  to  use,  for  simplicity,  the  name 
of  a  node  for  its  rank.  However,  it  would  be  quite  easy  to  use  some  other  ranking  function, 
such  as  time-stamp,  or  time-stamp  plus  name,  or  some  other  assigned  priority. 

The  sequenced  update  is  more  efficient  in  terms  of  total  messages  passed,  since  fewer 
elections  are  required.  It  can  be  easily  implemented  by  finding  for  each  node  i  the  value  of 
its  variable  pair  {high, low).  The  first  would  be  the  next  node  higher  than  i  among  the  con¬ 
tenders,  while  the  other  would  point  to  the  next  lower.  In  this  way,  a  distributed  doubly 
linked  list  is  produced  by  the  algorithm.  Initially,  every  node  has  a  value  of  TOP  for  high 
and  a  value  of  BOTTOM  for  low,  where  TOP  is  an  implementation  dependent  largest  num¬ 
ber,  and  LOW  a  corresponding  smallest  number.  Using  this  mechanism,  the  highest  and 
lowest  nodes  among  all  contenders  would  be  able  to  identify  themselves  as  such,  for  the 
highest  would  find  no  higher  node,  and  its  high  field  would  be  unchanged.  A  similiar  argu¬ 
ment  holds  for  the  lowest  node. 

In  this  variation,  each  node  would  have  to  see  all  the  identities  of  the  other  contenders, 
so  extinction  would  not  be  used.  Instead,  all  ballots  must  go  around,  and  the  node  which  is 
elected  highest  is  that  node  which,  upon  getting  its  ballot  back,  finds  that  its  high  is 
unchanged  from  TOP.  Similiarly,  the  smallest  node  in  the  ordering  of  contenders  would 
have  its  low  unchanged  from  BOTTOM.  To  ensure  that  the  highest  node  is  indeed  highest, 
instead  of  hoping  that  by  the  time  its  own  ballot  come  back  the  ballots  from  all  other  nodes 
have  visited  it,  we  take  a  more  positive  approach.  The  ballot  from  a  node  carries  its  name, 
and  is  sure  to  visit  all  other  nodes  before  it  returns  to  its  originator.  Thus,  we  let  the  ballot 
carry  the  information  as  to  what  the  next  larger  and  smaller  nodes  of  its  originator  are. 
Then  when  the  ballot  returns,  the  originator  is  assured  that  all  the  nodes  have  been  taken 
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V. 


\ 


into  account. 


Algorithm  3.4.  Sequenced  Update  Variant 

Let  each  node  initially  be  asleep.  When  a  node  wishes  to  update,  it  turns  itself  to  awake , 
sends  a  ballot  out  with  high  of  value  TOP,  and  low  of  BOTTOM.  The  ballot,  of  course,  also 
carries  its  node  identity.  A  sleeping  node  is  shutoff  by  a  ballot.  In  that  case,  if  it  subse¬ 
quently  wishes  to  update,  it  puts  the  request  in  a  queue  and  tries  again  when  asleep.  Con¬ 
sider  a  ballot-y  arriving  at  node  i. 

1.  If  node  i  is  asleep  then  become  shutoff.  Subsequent  update  requests  must  wait  for  a 
new  election.  Forward  the  ballot. 

2.  If  node  i  is  shutoff  just  forward  the  ballot. 

3.  If  (/  =j)  and  the  high  field  on  the  ballot  is  still  TOP,  then  node  i  begins  an  update. 

4.  If  (i*j),  then  if  i>j  and  /  <high  in  ballot-y,  replace  the  value  of  high  by  i.  If  i  <j  but 

/  >  low  of  ballot-y,  replace  the  value  of  low  by  i.  Send  ballot  on. 

5.  If  (t  =  j),  but  the  high  field  on  the  ballot  is  not  TOP,  the  node  just  waits  for  its  turn  to 
update. 

6.  A  node  that  is  allowed  to  update  sends  an  update  message  around  the  circle  to  all 

other  nodes.  When  the  update  returns,  it  then  checks  the  low  field  in  its  ballot.  If  this 
value  is  not  BOTTOM,  it  transfers  update  control  by  a  special  message  to  the  node 
named  there.  If  it  is  BOTTOM,  then  there  are  no  more  updates  in  this  cycle. 

7.  A  node  which  receives  an  update  message  will  perform  the  update,  pass  it  on,  and  then 
enter  the  asleep  state.  If  there  is  an  update  in  its  queue,  it  will  then  awaken. 

In  this  algorithm,  whether  /'  and  j  represent  priority  values  of  nodes,  or  node  names,  all 
contending  update  requests  will  be  serviced.  Hence,  starvation  is  not  possible. 

3.2.5.3.  Behaviour 

Let  us  consider  the  two  approaches  of  the  extinction  election  and  the  sequenced  update 
variant  in  terms  of  elapsed  time,  message  passes  and  number  of  comparisons.  Call  the  first, 
electing  only  one  candidate  each  time,  Algorithm  A,  and  the  other,  sequencing  many  in  one 
election,  Algorithm  B.  Algorithm  A  requires  k  elections  in  order  to  service  k  contending 
updates,  whereas  Algorithm  B  uses  one  execution  to  sequence  k  candidates,  following  which 
they  perform  their  updates. 

First  consider  elapsed  time.  Each  update  of  n  files  takes,  in  sequence  from  an  elected 
node,  at  least  time  n.  For  k  updates,  then,  no  matter  which  algorithm  is  used,  the  updates 
themselves  take  time  kn.  Between  the  two  algorithms,  however,  we  note  that  Algorithm  A 
requires  k  elections,  each  needing  at  least  time  n  and  at  most  time  2 n.  Thus,  the  time  attri¬ 
butable  to  Algorithm  A  itself  is  at  most  2nk  and  at  least  nk.  For  Algorithm  B,  however, 
sequencing  k  contending  updates  takes  only  approximately  time  n. 

Next  consider  message  passes.  For  Algorithm  A,  if  we  elect  one  candidate  from  k  nodes 
in  a  circle  of  k  nodes,  the  average  number  of  message  passes  is  k  H^,  where  H ^  is  the  A:th 
harmonic  number.  This  is  approximately  Ce  +  lnA:,  where  Ce  is  Euler’s  constant,  and  In  k  is 
the  natural  logarithm  of  k.  However,  each  message  pass  is  actually,  on  the  average,  n  /k 
edges  in  length.  This  assumes  that  the  expected  distribution  of  k  contenders  is  uniform 
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within  a  circle  of  n  nodes.  This  leads  to  an  average  number  of  message  passes  of 
k  H k  X  n  /k,  which  is  n  Hk- 

For  1st  election  from  k  nodes,  we  have  n  H k 
2nd  election  from  k  —  1,  we  have  n  Hk-\ 

3rd  election  from  k  —2,  we  have  n  Hk  -2 
etc. 

Thus,  for  all  k  elections,  assuming  no  new  nodes  enter  the  elections,  we  have: 

n  (Hk  +  H k  _  1  +  Hk  _2+...)  =  n  ( Ce  +  In  k  +  Ce  +  In  (k  —  1)  +  •  •  •  ) 

=  nkCe  +  n(ln  k  +  In  (A:  —  1)  +  ...) 

=  tikCe  +  n(ln  k  !) 

*  nkCe  +  nk\r\k  =  nk  H k 
~0(nk  log2&) 

For  Algorithm  B,  there  are  k  ballots,  and  each  goes  around  the  circle  of  n  nodes.  There¬ 
fore,  number  of  message  passes  from  ballots  is  (nk).  The  transfer  of  update  control  also  has 
to  pass  from  the  largest  of  the  contenders,  through  all  k  of  them,  after  each  performs  its 
updates.  A  simple  mechanism  which  would  perform  this  function  is  just  to  send  a  message 
around  the  circle,  which  stops  at  the  addressed  node.  On  the  average,  the  k  nodes  would  be 
distributed  evenly  about  the  circle  of  n  nodes.  However,  the  order  in  which  updates  would 
occur  is  not  necessarily  in  a  convenient  sequence.  Hence,  on  average,  each  such  message 
goes  n  / 2  edges.  Thus,  they  contribute  nk  /2  message  passes,  and  the  total  expected  number 
of  message  passes  is  3 nk  /2. 

In  considering  the  number  of  comparisons,  we  point  out  that  this  is  perhaps  the  least 
important  metric  for  distributed  systems  using  decentralized  algorithms,  while  the  most 
important  may  be  elapsed  time.  However,  comparisons  have  been  a  standard  measure  of 
complexity  of  algorithms.  Let  us  carefully  define  what  we  count.  A  ballot  arriving  at  a  con¬ 
tending  node  involves  one  comparison  between  the  rank  of  the  node  and  that  of  the  ballot. 
However,  at  every  node,  a  ballot  has  to  ascertain  whether  or  not  the  node  is  involved.  In 
reality,  therefore,  the  total  number  of  comparisons  comes  from  two  activities:  during  mes¬ 
sage  passing,  and  between  ballots  and  candidate  nodes.  For  the  first  election  in  Algorithm 
A,  there  are  kHk  ballot  comparisons  at  candidate  nodes,  and  there  are  nH k  comparisons 
during  message  passes.  For  the  second  election,  there  are  (k  —  1 )  Hk  - 1  ballot  comparisons, 
and  nHk-\  message  pass  comparisons. 

This  is  very  similiar  to  the  computation  for  message  passes.  In  fact,  for  comparisons 
from  message  passes,  it  is  simply  one  comparison  per  message  pass,  which  is  a  total  of 
nk  Hk .  For  ballot  comparisons,  the  total  number  is: 

kH k  +  (k  —  \)  H k-  \  +  (k  —  2)  H k  -2+  ■  •  ■ 

=  kHk  +  kHk- 1  +  kHk- 2+  '  ’  ' 

_  (Hk-  1  +  Hk- 2+  •••) 

This  simplifies  to 
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k2Hk -kHk-\ 

Add  to  this  the  term  from  message  passes,  we  get  total  comparisons  as 
nk  Hk+k2Hk  —kHk 
*  (nk  +  k2)  Hk 

~  (nk  +  k  2  )log2 k  =  O  (nk  log  k  ) 

For  Algorithm  B,  there  are  k  ballots,  each  of  which  goes  n  edges  for  message  passes,  requir¬ 
ing  nk  comparisons.  In  addition,  at  each  candidate  node,  two  comparisons  are  needed, 
which  is  another  2k  2  comparisons.  The  total  is  therefore,  nk  +  2  k2  compares  for  k  candi¬ 
dates  in  a  circle  of  n  nodes. 

Algorithm  A  uses  k  elections  to  allow  k  contenders  to  update,  whereas  Algorithm  B  uses 
one  sequencing  of  k  updates.  The  table  below  summarizes  the  differences  in  their 
behaviours. 

elapsed  time  message  passes  compares 

(election  only) 

Algorithm  A 

k  elections  *  Ink  nkH k  (k2  +  nk)H k 

Algorithm  B 

~  3 nk  ?  .  i 

one  sequencing  ~  n  ^  2k  +  nk 

of  k  updates 

3.2.5.6.  Correctness  Arguments 

Consider  Algorithm  A.  This  is  simply  a  version  of  the  Start-up  Variant,  except  that 
instead  of  using  each  and  every  node  in  the  election,  only  those  which  got  their  bids  in 
before  any  ballot  gets  to  them  would  be  included.  There  is  otherwise  no  difference.  Hence, 
this  is  logically  an  election  among  i  nodes,  and  one  and  only  one  node  is  sure  to  be  elected. 

For  Algorithm  B,  we  have  a  variant  of  LeLann’s  algorithm.  We  need  only  show  that  for 
every  node  in  the  election,  its  ballot  will  visit  all  nodes  in  the  election.  This  follows  immedi¬ 
ately  from  the  circular  configuration  of  nodes  and  the  fact  that  a  node,  once  shutoff,  cannot 
enter  the  current  election,  but  must  wait  until  the  next  one. 

3.2.6.  Discussion 

We  will  not  discuss  reliability  issues  for  the  multiple-copy  update  variants  of  the  basic 
extinction  algorithm  except  to  say  that  the  fail-safe  mechanism  can  be  applied  to  the  algo¬ 
rithm  which  elects  one  from  k  contenders.  However,  the  sequenced  update  version  depends 
on  all  nodes  being  intact  to  move  down  the  list  of  updating  nodes.  This  makes  it  more  sus¬ 
ceptible  to  failure. 

The  purpose  of  the  ^-sequencing  algorithm  is  to  produce  a  distributed  list  of  k  conten¬ 
ders.  We  will  treat  this  aspect  first  in  the  face  of  failures.  Every  contender  sends  a  message 
around  to  find  its  neighbours  in  the  /r-ordering.  Assume  that  the  loss  of  a  node  is  permanent, 
for  otherwise  we  only  need  wait  for  it  to  recover  and  resume  the  algorithm.  Furthermore,  if 
a  line  failed  permanently,  then  the  circle  is  broken,  and  the  algorithm  can  no  longer  apply  to 
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this  new  graph.  Hence,  we  only  consider  permanent  node  failures. 

Firstly,  if  a  contending  node  fails,  its  own  message  will  not  be  removed.  Secondly,  it 
may  incur  the  loss  of  one  or  more  messages  with  its  failure.  Thirdly,,  it  may  already  have 
been  recorded  as  the  successor  of  another  contender  node.  The  first  problem  can  be  dealt 
with  by  modifying  the  algorithm  so  that  the  message  from  node  /,  arriving  at  a  node  j  which 
is  its  next  bigger  or  smaller  in  the  k  ordering  for  the  second  time,  will  be  checked  to  see  if  j 
is  already  in  the  corresponding  field  of  the  message.  If  so,  the  message  is  destroyed  by  node 

j- 

If  the  largest  node  is  the  node  which  failed,  then  the  updates  never  get  started,  unless  the 
condition  is  detected.  Any  node  i  can  identify  the  message  of  the  largest  contender,  first  by 
keeping  track  of  the  largest  originator  of  messages  seen,  and  then  by  confirming  that  the 
largest  such  message  seen  for  the  second  time  is  the  highest  contender  since  its  high  field 
contains  the  number  TOP.  When  this  situation  arises,  the  detecting  node  can  remove  the 
message,  and  send  the  control  to  update  to  the  node  named  in  the  low  field  of  the  message. 

The  question  of  message  loss  must  be  dealt  with  by  a  time-out  mechanism,  following 
which  the  detecting  node  sends  out  a  query,  which  can  be  disregarded  if  the  expected  mes¬ 
sage  arrives  before  the  query  returns.  This  technique  is  adapted  from  LeLann  [LELA  77].  If 
the  query  returns  before  the  expected  message,  then  the  message  is  considered  lost,  and  a 
replacement  message  can  be  issued. 

In  the  final  case,  if  a  node  which  fails  has  been  recorded  as  the  next  lower  of  an  intact 
contender,  then  the  chain  of  updates  will  be  broken.  Furthermore,  the  updates  cannot  con¬ 
tinue  unless  the  node  which  has  completed  its  own  update,  on  finding  it  cannot  transfer  con¬ 
trol  to  a  lost  node  i,  sends  around  a  special  message  to  find  the  largest  contender  smaller 
than  /,  and  then  transfers  control  to  it. 

The  problems  of  a  node  failing  while  a  distributed  update  from  one  node  is  in  progress, 
and  of  how  the  failed  node  can  do  the  update  once  it  recovers,  are  not  problems  unique  to 
our  decentralized  algorithms.  They  have  been  addressed  in  other  research,  for  example  by 
Lampson  and  Sturgis  [LAMPS  76], 

The  algorithms  which  use  the  principle  of  message  extinction  have  the  curious  property 
that  unidirectional  messages  are  better  than  messages  sent  out  in  both  directions.  In  the 
basic  extinction  algorithm,  consider  ballot-(«  —1).  If  this  was  sent  around  in  both  directions, 
clearly  both  ballots  would  have  to  be  extinguished  by  node  n.  Therefore,  instead  of  an 
average  of  n  / 2  message  passes,  we  would  need  n  message  passes!  Furthermore,  there  is  no 
saving  in  elapsed  time,  for  although  ballots  less  than  n  can  be  extinguished  in  time  n  /2  on 
the  average,  the  determining  factor  is  the  time  needed  for  ballot  n  to  get  back  to  itself. 
Whether  unidirectional  or  bidirectional,  ballot  n  needs  time  n  to  traverse  all  nodes  and 
return,  in  order  for  n  to  be  elected. 
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In  Chapter  4,  we  will  consider  the  extension  of  these  ideas  concerning  mutual  exclusion 
for  distributed  systems  to  the  general  case  of  processors  which  are  interconnected  in  arbi¬ 
trary  graph  configurations.  We  will  introduce  a  technique  for  parallel  graph  traversal,  apply 
it  to  create  election  algorithms  for  the  general  case.  This  technique  will  be  shown  to  be  a 
powerful  basis  for  many  other  decentralized  algorithms.  Having  described  the  extended 
election  algorithms,  we  will  then  make  some  comparisons  between  them  and  other  solutions 
to  the  multiple  copy  update  problem. 
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Chapter  4 


Echo  Algorithms 


4.1.  Introduction 

The  technique  of  depth-first  search  is  well  known  in  graph  algorithms  [AHO  74],  For  ex¬ 
ample,  finding  the  strong  components  of  a  directed  graph  [TARJ  72],  or  the  biconnected 
components  of  a  general  graph  [AHO  74],  are  based  on  this  method.  However,  depth-first 
search  is  a  technique  which  assumes  that  only  one  operation  at  a  time  is  performed  on  a 
graph.  We  ask  whether,  given  the  possibility  of  parallel  graph  operations,  there  is  a  parallel 
traversal  method  which  is  analagous  to  depth-first  search.  In  theory,  such  a  traversal 
method  could  ccver  the  graph  in  time  linearly  proportional  to  the  traversal  of  the  diameter 
of  the  graph,  the  longest  of  the  shortest  paths  between  all  pairs  of  nodes.  A  sequential  algo¬ 
rithm  can  do  no  better  than  visit  each  node  in  turn.  Furthermore,  if  there  is  a  procedure 
which  can  logically  be  separated  into  independent  parts,  it  is  reasonable  to  execute  these 
parts  simultaneously,  rather  than  sequentially. 

We  will  present  a  class  of  algorithms  for  detecting  properties  of  general  graphs  which 
are  based  on  the  distributed  system  we  introduced  as  Model  2  in  Chapter  1.  To  recapitu¬ 
late,  each  node  of  the  graph  is  an  independent  processor  which  interacts  with  its  neighbours 
through  messages.  No  global  storage  or  controller  is  used.  There  is  furthermore  no  need  for 
any  node  to  know  the  extent  or  membership  of  the  entire  graph.  Rather,  the  graph  exists  in 
distributed  form  through  each  node’s  awareness  of  its  adjacency  relationships.  These  algo¬ 
rithms  are  decentralized,  and  operate  through  the  cooperative  behaviour  of  the  nodes.  They 
are  based  on  message  passing,  and  use  a  parallel  graph  traversal  technique  which  takes  ad¬ 
vantage  of  potential  simultaneous  activity. 

As  we  present  each  algorithm,  we  will  indicate  its  relevance  to  the  control  requirements 
of  a  distributed  system.  The  decentralized  nature  of  our  method  makes  these  algorithms 
quite  different  from  other  parallel  models  which  have  been  proposed  for  finding  graph  pro¬ 
perties.  For  example,  a  parallel  depth-first  method  in  the  literature  is  based  on  k-processors 
sharing  common  memory  [ECKS  77],  The  studies  of  parallelism  for  graph  algorithms  by 
Aijomandi  [ARJO  78]  also  assume  k-processors  and  common  memory.  Rosenstiehl’s 
[ROSE  72]  distributed  algorithms  based  on  a  network  of  finite  state  machines  are  very  close 
in  concept  to  ours,  but  assume  a  synchronous  system  with  simultaneous  transitions  based  on 
sensing  the  states  of  all  neighbours  at  each  step. 

Fully  parallel  algorithms  on  graphs  must  solve  some  basic  problems.  If  several  edges 
lead  to  one  node,  and  the  parallel  traversal  of  edges  starting  from  some  initial  node  should 
arrive  at  that  node  simultaneously,  how  is  this  to  be  handled?  Does  the  message  from  each 
of  the  edges  get  passed  on,  and  if  not,  what  is  to  be  done  with  the  ones  which  are  aborted? 
How  does  information  get  back  to  the  starting  node  in  a  coordinated  fashion?  We  shall  show 
that  the  class  of  parallel  graph  algorithms  which  we  call  Echo  Algorithms  address  these 
problems  in  simple  and  efficient  ways. 
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4.2.  Echo  Algorithms 

The  basic  ideas  behind  echo  algorithms  are  simple,  and  will  be  described  informally  in 
this  section.  Given  a  general  graph  with  intelligent  nodes  which  can  communicate  along  its 
edges,  the  first  idea  is  that  message  passing  is  the  fundamental  operation  of  any  echo  algo¬ 
rithm.  Traversal  of  the  graph  therefore  means  passing  messages  from  one  node  to  another. 
For  any  particular  node  /  which  starts  the  execution  of  an  echo  algorithm,  the  messages  ori¬ 
ginating  from  /  form  a  family,  sharing  the  identity  of  i  in  common. 

The  second  idea  is  that  there  are  two  phases  in  the  traversal  of  a  graph:  a  forward  phase 
and  an  echo  phase.  The  forward  traversal  of  a  graph  from  a  starting  node  is  accomplished 
by  explorers,  and  the  echo  phase  by  echos.  Let  us  confine  ourselves  at  this  point  to  single¬ 
source  echo  algorithms,  those  which  are  started  by  one  node,  so  that  we  can  study  the 
behaviour  of  one  family  of  explorers  and  echos. 

The  third  idea,  then,  is  that  each  node  which  is  visited  for  the  first  time  by  an  explorer 
will  propagate  explorers  in  parallel  along  all  of  the  out-edges  of  that  node.  For  a  connected 
undirected  graph,  these  would  be  all  edges  except  the  edge  at  which  the  first  explorer  ar¬ 
rived.  This  edge  is  called  the  first  edge.  Explorers  coming  to  a  visited  node  will  turn  into 
echos,  as  will  explorers  coming  to  a  sink  node,  one  which  has  no  other  edges.  Echos  travel 
in  directions  opposite  to  that  of  explorers. 

The  fourth  principle  is  that  of  synchronization.  A  node  will  echo  on  its  first  edge  after  it 
receives  an  echo  for  each  explorer  sent  out.  This  is  called  the  echo-merge  mechanism.  We 
assume  that  there  is  an  arbiter  mechanism  at  each  node,  such  that  if  messages  should  arrive 
simultaneously,  they  are  given  some  arbitary  sequential  ordering. 

The  last  idea  involved  in  echo  algorithms  is  that  explorers  and  echos  carry  information 
with  them  about  those  parts  of  the  graph  which  they  have  traversed.  A  node  which  syn¬ 
chronizes  echos  will  process  this  information,  and  send  the  result  along  on  the  echo  from 
that  node.  The  starting  node  will  finally  receive  all  its  echos  from  its  out  edges,  ar.d  after 
processing  their  information,  obtain  the  result  of  the  algorithm. 

An  echo  algorithm,  then,  is  started  by  some  initiating  node  sending  out  in  parallel  as 
many  explorers  as  there  are  out-edges,  each  one  carrying  the  identity  of  the  starting  node. 

4.2.1.  Definitions 

Given  a  graph  G  =<V,E>  where  V  is  a  non-empty  set  of  nodes  and  £  is  a  set  of  edges 
of  the  form  ( x.y )  where  x  and  y  are  members  of  V,  let  n  be  the  cardinality  of  V.  and  e  the 
cardinality  of  E.  We  distinguish  several  classes  of  graphs.  All  connected  undirected  graphs 
are  called  C-graphs.  For  directed  graphs,  there  are  several  possible  sub-graph  relationships. 
A  directed  graph  G  is  type  I,  or  digraph  l,  if  it  is  a  strongly  connected  graph,  in  which  every 
node  is  reachable  from  every  other  node.  We  call  the  reach  set  of  a  node  v  the  set  of  nodes 
in  G  to  which  there  exists  a  directed  path  from  v.  Then  a  digraph  II  is  a  directed  graph 
whose  reach  sets  are  dissimilar,  but  one  of  which  is  the  entire  set  of  nodes  of  the  graph.  A 
digraph  III  is  a  directed  graph  which  has  no  reach  set  containing  all  the  nodes  of  the  graph. 
Thus,  there  exists  no  node  from  which  all  other  nodes  can  be  reached.  Figure  4.1  illustrates 
these  different  classes  of  graphs.  A  sink  node  in  a  C-graph  is  a  node  with  only  one  edge, 
while  for  a  digraph,  it  is  a  node  with  no  edges  directed  out  from  it. 

An  initiator  node  5  is  a  member  of  V  which  produces,  in  an  execution  of  an  echo  algo- 
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A  C -graph 


A  Digraph  III 

no  node  has  a  reachset 
which  is  the  entire  graph 


A  Digraph  I  the  reach  set  of 
every  node  is  the 
set  of  nodes  of  the 
graph 


Reach  set  of  a^  is  {b,c,d} 
b  is  {c  ,  d  } 
c_  is  {d} 
d  is  U 

A  Digraph  II  the  reachset  of  a_  is  the 

entire  graph  (it  reaches  itself) 


Figure  4.1  Some  Graphs 


rithm,  a  family  of  explorer  and  echo  messages.  Let  explorer [a.b]  represent  the  explorer  go¬ 
ing  from  node  a  to  node  b.  Then  if  a  represents  an  arbitrary  node,  explorer  [a.b]  is  any  ex¬ 
plorer  coming  to  node  b ,  and  explorer  [a.  a]  would  be  any  explorer  leaving  node  a.  We 
adopt  the  same  convention  for  echos,  and  use  {)  to  represent  a  set  in  the  usual  manner,  with 
a  suffix  -S  to  indicate  the  initiator.  Thus  {echo  [a.b  ])-S  would  refer  to  all  the  echos  going  to 
node  b  belonging  to  the  family  of  messages  of  initiator  S. 

By  [a.b]  we  will  mean  the  edge  going  from  a  to  node  b ,  and  by  [b.a]  we  will  mean  the 
same  edge  but  in  the  sense  of  b  to  a.  To  convey  a  neutral  sense  of  an  edge  connecting  a  and 
b ,  we  use  (a.b ). 

An  explorer  which  is  the  first  tc  arrive  at  a  node  is  called  a  primary  explorer.  An  edge 
carrying  a  primary  explorer  is  a  P-edge.  A  node  may  have  several  P-edges,  but  the  P-edge 
on  which  it  was  first  visited  is  called  its  first  edge.  Its  other  P-edges  are  first  edges  to  their 
successor  nodes.  Non-P  edges  clearly  carry  explorers  to  already  visited  nodes.  Every  node 
has  a  first  edge  except  the  initiator,  which  is  considered  visited  a  priori.  A  node  which  has 
P-edges  leading  out  of  it  is  called  a  P-node,  and  a  node  which  does  not  is  a  non-P  node. 

Echos  arise  in  two  situations:  at  the  termination  of  an  explorer,  and  from  a  node  which  has 
received  an  echo  for  every  explorer  it  has  sent  out,  and  then  itself  echos  on  its  first  edge.  In 
the  first  case,  such  an  echo  is  called  an  initial  echo,  and  the  node  at  which  its  corresponding 
explorers  terminated  is  called  the  origin  of  the  initial  echo. 

4.2.2.  General  Properties 

In  order  to  elicit  some  properties  general  to  most  of  the  echo  algorithms,  let  us  first 
describe  the  pure  traversal  algorithm.  This  will  also  establish  a  prototype  for  the  description 
of  other  algorithms  to  follow. 

Algorithm  4.0  is  a  traversal  of  a  graph  from  an  initiator  node,  and  we  cannot  traverse 
nodes  which  are  not  reachable.  Hence,  for  a  digraph,  we  can  only  study  the  subgraph  G ’  in¬ 
duced  from  G  by  the  reach  set  of  the  initiator  node  5  in  G.  Call  this  the  S-reach  graph  of 
the  original  graph  G.  Therefore  Algorithm  4.0  can  apply  to  any  C-graph  or  digraph  I,  and 
to  the  S-reach  graphs  of  digraph  II  and  digraph  III.  Figure  4.2  illustrates  these  different 
types  of  graphs. 

Explorers  and  echos  represent  messages  of  two  types  going  from  node  to  node.  The  basic 
identity  of  each  is  thus  type  and  family  name  S.  Implicit  to  the  message  is  the  TO  and 
FROM  node  information,  and  other  protocol  which  the  communications  system  may  re¬ 
quire.  These  are  constant,  and  we  include  them  under  the  notion  of  basic  identity.  Algorithm 
4.0  requires  no  more  than  basic  identity  on  a  message. 

Algorithm  4.0.  Pure  Traversal 

First,  assume  that  initiator  5  sends  explorers  in  parallel  on  all  its  out-edges,  where  an 
out-edge  is  a  directed  edge  from  5  for  digraphs,  and  all  the  edges  of  5  for  a  C-graph.  We 
must  consider  the  activity  at  each  node  for  the  arrival  of  an  explorer  or  an  echo  on  a  partic¬ 
ular  edge. 
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a 


A  Digraph  II  G 


The  b-reach  graph  of  G 


A  Digraph  III  G' 


The  d -reach  graph  of  G' 


Figure  4.2  Some  reach  graphs 


1. 


If  the  explorer  is  the  first  to  arrive  at  the  node,  mark  the  edge  as  first,  and  send 
explorers  in  parallel  from  the  node  on  the  out-edges  of  the  node.  For  a  digraph,  these 
are  edges  directed  from  the  node.  For  a  C-graph,  these  are  all  edges  except  the  first 
edge. 

2.  If  the  explorer  is  not  the  first,  or  if  there  are  no  out-edges,  then  echo  back  along  the 
edge  on  which  the  explorer  arrived. 

3.  If  an  echo  comes  to  the  node,  then  mark  the  edge  as  having  received  an  echo.  If  all 
echos  for  the  node  have  arrived,  then  send  an  echo  back  along  the  first  edge  of  the 
node,  unless  the  node  is  the  initiator,  in  which  case  we  are  finished. 

Let  us  look  at  the  properties  of  the  pure  traversal  algorithm.  These  are  based  on  the 
three  fundamental  mechanisms  of  echo  algorithms:  explorers  are  sent  in  parallel  from  a 
node,  each  node  has  only  one  edge  on  which  it  is  first  visited  by  an  explorer,  and  a  node 
waits  for  all  its  echos  to  come  back  before  it  itself  echos  on  its  primary  edge. 

Property  4.1.  Each  node  receives  at  least  one  explorer. 

Argument :  By  assumption,  only  those  graphs  and  sub-graphs  in  which  all  nodes  are  reach¬ 
able  from  5  are  in  question.  Therefore  every  node  has  a  path  from  S.  If  any  node  did  not 
receive  an  explorer,  then  its  predecessor  on  the  path  from  S  could  not  have  received  one.  By 
induction,  either  S  did  not  emit  any,  or  else  the  node  is  not  reachable.  In  any  case,  presum¬ 
ing  no  loss  of  explorers,  a  contradiction  arises.  Hence  there  is  no  such  node.  □ 

Property  4.2.  Eventually,  all  explorer  activity  will  terminate. 

Argument:  We  are  only  concerned  with  finite  graphs.  By  Property  4.1,  every  node  will  even¬ 
tually  be  visited,  and  any  explorers  generated  thereafter  can  only  come  to  sink  nodes  or 
visited  nodes,  turning  into  echos.  □ 

Property  4.3.  There  exist  non-P  nodes,  which  have  no  P-edges  leading  from  them. 

Argument:  We  are  referring  to  edges  which  are  first  edges  to  their  successor  nodes.  Trivi¬ 
ally,  sink  nodes  have  no  out-edges.  Furthermore,  by  Property  4.1,  all  nodes  eventually  get 
visited  for  the  first  time  by  an  explorer.  Hence,  the  last  such  node  can  send  explorers  only  to 
visited  nodes.  Thus  there  are  no  P-edges  leading  from  it.  □ 

Property  4.4.  A  P-node  sending  a  primary  explorer  to  its  successor  can  be  said  to  precede  it. 
Then  there  can  be  no  cycle  of  precedence. 

Argument:  If  node  a  sends  a  primary  explorer  to  b ,  then  this  will  cause  b  to  send  explorers 
from  b.  Hence  the  activation  of  a  could  not  have  been  from  one  of  these  explorers.  Thus  if 
a  precedes  b,  then  b  could  not  precede  a. 

Corollary  l:  It  follows  immediately  that  an  edge  (a,b),  if  it  is  a  P-edge,  can  only  carry  an 
explorer  in  one  direction,  either  from  a  to  b,  or  vice  versa. 

Corollary  2:  It  also  follows  that  a  non-P  edge  must  carry  explorers  in  both  directions.  For  if 
a  and  b  are  not  activated  one  by  another,  they  must  have  both  sent  out  explorers  on  all  their 
non  -first  edges.  Clearly,  the  edge  (a,b)  is  such  an  edge.  Thus,  it  must  carry  explorer[a.6]  as 
well  as  explorer[6.a],  □ 
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Property  4.5.  Every  explorer  on  an  edge  gets  a  corresponding  echo. 

Argument:  Consider  all  non-P  edges.  They  carry  explorers  to  visited  nodes,  and  immediately 
induce  an  echo.  For  those  P-edges  which  lead  to  sink  nodes,  a  corresponding  echo  is  also 
generated  at  that  node.  It  follows  that  a  node  which  only  has  non-P  edges  leading  from  it 
will  get  all  its  echos,  and  be  able  to  send  an  echo  on  its  first  edge,  or  else  it  is  a  sink  node, 
and  also  echos. 

A  P-node  has  a  primary  edge  leading  out  of  it,  and  a  non-P  node  does  not.  Consider  an 
explorer  on  a  P-edge  from  a  P-node.  It  either  leads  to  another  P-node  or  to  a  non-P  node. 
By  induction  on  the  finite  size  of  the  graph,  all  P-nodes  must  eventually  lead  to  non-P 
nodes.  In  the  previous  paragraph,  we  have  shown  that  non-P  nodes  will  echo  on  their  first 
edges.  Hence,  the  P-edges  leading  to  non-P  nodes  will  receive  echos.  By  induction,  all  P- 
edges  will  eventually  receive  an  echo.  □ 

43,  Performance  of  Algorithm  4.0 

Consider  number  of  message  passes  first.  It  is  bounded  by  4e,  where  e  is  the  number  of 
edges  in  the  graph.  Since  each  edge  (a, b )  can  have  at  most  2  explorers,  one  in  each  direc¬ 
tion,  and  2  corresponding  echos,  the  total  number  is  bounded  by  4e.  Note  that  for  a 
digraph,  it  is  2e,  since  there  are  no  symmetrical  pairs  of  explorers  which  travel  on  directed 
edges. 

In  Chapter  1,  we  discussed  the  rationale  for  looking  at  communication  time,  and  for 
assuming  that  each  edge  takes  approximately  one  unit  of  time  to  traverse,  so  that  we  can 
estimate  bounds  for  the  communication  cost  of  a  decentralized  algorithm.  Define  the  S-span 
of  a  graph  as  the  longest  of  the  shortest  paths  from  5  to  any  element  in  the  reach  set  of  S. 
When  the  metric  of  weight  used  is  message  travel  time,  we  used  the  term  timed  S-span. 

It  follows  that  the  timed  S-span  of  a  graph  represents  the  time  it  takes  for  explorers  to 
reach  every  node  of  the  graph,  called  the  forward  phase  of  the  echo  algorithm.  If  we  assume 
that  explorers  and  echos  have  the  same  speeds,  then  the  traversal  of  the  graph  from  5  will 
take  twice  the  timed  S-span  of  the  graph. 

For  a  C-graph  or  a  digraph  I.  traversals  starting  from  any  node  i  will  have  the  same 
reach  set.  The  largest  of  the  /-spans  is  referred  to  as  the  diameter  of  the  graph.  The  timed 
diameter  is  then  the  maximum  of  all  traversals  of  the  graph,  for  all  starting  nodes.  We 
extend  this  notion  to  include  digraph  I  Is,  even  though  the  reach  sets  of  different  nodes  may 
be  different.  The  largest  /-span  will  be  taken  to  be  the  diameter  of  a  digraph  II.  From  this 
point  on,  unless  otherwise  specified,  we  will  mean  timed  path  length  when  we  refer  to  path 
length,  and  timed  diameter  when  we  refer  to  diameter. 

In  the  execution  of  a  pure  traversal  algorithm  from  an  initiator  5.  the  communication 
time  is  less  than  or  equal  to  twice  the  diameter  of  the  graph.  This  result  follows  immediately 
from  the  definition  of  diameter  and  the  parallel  activity  of  the  algorithm. 

The  storage  required  at  each  node  for  a  pure  traversal  algorithm  is  0{n)  bits,  where  n  is 
the  number  of  nodes  in  the  graph.  This  follows  from  observing  that  a  node  /  has  at  most  n 
edges,  each  of  which  needs  one  bit  to  mark  the  arrival  of  its  echo.  To  mark  the  primary 
edge  of  /  requires  only  log/;  bits,  and  to  maintain  the  name  of  the  node  only  uses  log//  bits. 
Finally,  each  message  carries  the  basic  identity  of  the  initiator  and  a  type,  which  is 
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(log  n  +  1)  bits.  Finally,  one  bit  is  needed  to  mark  a  node  as  visited.  Thus,  the  total  is 
(n  +  3  log n  +2)  bits,  which  is  0(n)  bits. 

4.4.  Traversal  Execution  Graph 

Since  we  have  not  made  any  particular  assumptions  as  to  the  exact  speed  of  explorers  or 
echos,  a  particular  execution  of  an  echo  algorithm  may  cause  different  sequences  of  arrivals 
of  explorers,  and  different  edges  to  be  primary  edges.  Each  execution  of  a  graph  G  by  a  pure 
traversal  algorithm  can  be  represented  by  an  execution  graph,  EG,  drawn  as  follows: 

Draw  the  node  S’  in  EG  to  correspond  to  the  node  S  in  G,  and  for  each  explorer 
which  goes  from  S  to  a  successor  node  /  in  G,  create  a  new  node  i  in  EG.  and 
draw  a  directed  arc  from  S’  to  f.  Do  this  for  each  explorer  coming  from  a  node  i 
in  G,  creating  a  new  node  in  EG  to  correspond  to  its  successor  in  G.  If  an 
explorer  terminates  at  a  visited  node  or  a  sink  node  in  G,  its  corresponding 
directed  edge  in  EG  terminates  in  a  leaf  node  of  EG.  Nodes  in  EG  are  labelled 
according  to  the  names  of  their  corresponding  nodes  in  G. 

The  execution  graphs  for  different  types  of  graphs  share  the  same  general  characteristics, 
but  differ  in  some  details.  We  will  introduce  their  salient  features  by  considering  the  execu¬ 
tion  graphs  of  connected  undirected  graphs  first,  and  then  seeing  what  the  differences  are  in 
the  case  of  directed  graphs.  Figure  4.3  shows  a  C-graph,  and  several  of  its  possible  execution 
graphs.  In  spite  of  the  differences  in  the  topolgy  of  these  execution  graphs,  however,  they 
exhibit  some  remarkably  regular  properties. 

First  of  all,  it  is  easy  to  see  that  each  is  a  directed  tree,  in  which  the  root  is  the  initiator 
S,  and  directed  edges  represent  the  movement  of  explorers  from  a  node  which  is  the  root  of 
a  sub-tree,  to  its  successors.  The  movement  of  echos  is  up  the  tree  EG,  with  the  echo-merge 
mechanism  operating  at  each  root  of  a  sub-tree,  to  produce  a  new  echo.  A  leaf  node  in  EG 
which  has  a  corresponding  internal  node  in  EG  represents  the  stopping  of  an  explorer  at  a 
visited  node  in  G.  In  fact,  if  a  is  the  leaf  node  and  b  is  its  immediate  predecessor  in  EG, 
then  not  only  do  a  and  b  exist  as  internal  nodes  in  EG,  but  there  also  exists,  by  Property 

4.4,  the  edge  [ a,b ]  in  EG,  with  b  being  a  leaf.  A  leaf  in  EG  with  no  corresponding  internal 
node  represents  a  sink  node  in  G. 

4.4.1.  The  Leaves  of  EG 

The  number  of  leaves  of  EG  is  equal  to  the  number  of  distinct  explorers  in  EG,  which  is 
also  the  number  of  distinct  paths  taken  in  the  traversal  of  G.  The  graph  EG  has  a  number 
of  internal  nodes  EG.int,  and  a  number  of  leaf  nodes  EG. leaf.  The  original  graph  G  has  n 
nodes,  one  of  which  is  S,  the  initiator  of  the  algorithm.  The  number  of  edges  of  S  is  called 
the  degree  of  S,  written  S.d.  The  rest  of  the  graph  G  may  have  a  number  of  sink  nodes, 
each  of  which  has  only  one  edge.  Call  this  number  G.sink.  Each  of  these  corresponds  to  a 
leaf  node  in  EG  which  has  no  matching  internal  node.  We  see  that  the  internal  nodes  in  EG 
are  exactly  the  nodes  of  G  which  are  not  sinks,  ie., 

EG.int  =  n  —G.sink 

The  number  of  leaves  of  EG  can  be  found  by  the  following  calculation  for  C-graphs. 

Let  S.d  be  the  degree  of  S 

n  be  the  number  of  nodes  of  G 
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Figure  4 . 3  a  C-graph  and  some 


|R}  be  the  set  of  nodes  in  G  which  are  not  sinks  or  S, 
with  cardinality  r 

6  be  the  sum  of  the  degrees  of  the  nodes  in  { R} 

Proposition  4.1. 

EG. leaf  =  S.d  +  [d-2r] 

Argument:  Start  with  the  degree  of  S.  There  are  at  least  that  many  explorers  in  the  execu¬ 
tion  of  G.  Consider  the  remaining  nodes.  A  node  with  two  edges  has  one  primary  in-edge 
and  one  out-edge.  Thus,  an  explorer  coming  to  such  a  node  does  not  create  an  additional 
path,  but  merely  extends  an  existing  one. 

Therefore,  the  number  of  edges  in  excess  of  two  at  each  of  the  remaining  nodes 
represents  the  number  of  additional  paths  created  by  explorers  starting  from  S.  However,  if 
a  node  is  a  sink  in  G,  clearly  it  does  not  add  any  more  paths  since  it  only  has  one  edge. 
Thus,  we  exclude  all  sink  nodes,  and  the  initiator  node,  from  consideration.  Call  this 
remaining  set  of  nodes  R,  with  cardinality  r.  For  each  of  these  nodes  the  number  of  addi¬ 
tional  paths  is  the  number  of  edges  at  the  node,  i.e.,  the  degree  of  the  node,  in  excess  of 
two.  For  all  nodes  in  (R),  then,  the  additional  paths _are  the  sum  of  the  degrees  of  these 
nodes  less  twice  their  number  (2 r).  □ 

4.4.2.  The  Edges  of  EG 

Each  edge  in  EG  represents  the  movement  of  an  explorer  and  therefore  the  total  number 
of  edges  in  EG  is  a  measure  of  the  total  work  done  in  one  execution  of  a  pure  traversal 
algorithm.  It  turns  out  that  this  number  is  dependent  only  on  the  original  graph  G,  and  not 
at  all  on  the  manner  of  traversal.  Furthermore,  the  number  of  edges  in  EG  can  be  computed 
from  a  C-graph  G,  as  follows: 

If  EG.e  is  the  number  of  distinct  edges  of  EG  we  wish  to  count,  and  EG. leaf  is  the 
number  of  leaves  of  EG,  which  we  can  compute  from  G,  as  above,  then  given  that  we  also 
know  G.e,  the  total  number  of  edges  in  G,  and  G.sink,  the  number  of  sink  nodes  in  G  not 
including  S,  then 

Proposition  4.2. 

EG.e  =G.e  +{EG.leaf  —  G.sink)/2 

Argument:  Note  first  that  every  edge  of  G  is  in  EG,  either  as  an  edge  leading  to  an  internal 
node,  or  an  edge  leading  to  a  leaf  node  which  is  a  sink.  Thus  the  number  of  edges  in  EG  is 
at  least  G.e,  the  number  of  edges  in  G. 

Now  consider  those  leaf  nodes  in  EG  which  represent  the  stopping  of  an  explorer  at  a 
visited  node.  If  such  a  path  is  [a, b],  from  a  to  b,  then  by  Property  4.5,  there  must  be  a  sym¬ 
metric  path  [ b,a ]  which  holds  an  explorer  going  the  opposite  way  which  stops  at  node  a. 
Each  edge  in  G  carrying  an  explorer  to  a  visited  node  therefore  contributes  an  additional 
edge  to  EG. 

The  number  of  such  additional  edges  is  easily  found.  The  leaf  nodes  of  EG  represent 
either  explorers  stopping  at  sinks,  or  at  visited  nodes,  in  which  case  such  explorers  occur  in 
pairs.  Therefore,  if  we  know  the  number  of  sink  nodes  in  G  which  contribute  to  the  leaves 
of  EG,  the  remaining  leaves  of  EG  are  those  from  explorers  stopping  at  visited  nodes.  The 
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number  of  sinks  in  G  can  be  counted  by  simply  examining  G.  with  the  proviso  that  if  the 
detector  node  5  is  a  sink  node  (has  only  one  edge),  then  it  is  not  included,  for  it  cannot  con¬ 
tribute  to  a  leaf  of  EG,  being  by  definition  the  root  of  EG. 

The  number  of  leaf  nodes  in  EG  from  visited  nodes  in  G  is  then  EG. lea f —G. sink,  and 
the  number  of  additional  edges  of  EG  in  excess  of  G.e  is  half  this  number.  But  EG. leaf  can 
be  found,  by  Proposition  4.1,  from  knowing  some  parameters  of  the  original  graph  G. 
Therefore  the  number  of  distinct  edges  in  EG  can  also  be  determined  from  the  graph  G 
alone.  □ 

For  directed  graphs,  a  similar  execution  graph  can  be  drawn.  The  number  of  leaf  nodes 
is  found  by  taking  the  out-degree  of  5  (the  number  of  out-edges  of  S ),  and  adding  the 
number  of  out-edges  in  excess  of  one  at  each  of  the  remaining  nodes.  This  follows  from  the 
simple  observation  that  additional  paths  are  created  only  at  nodes  which  have  more  than  one 
out-edge. 

The  EG  for  a  directed  graph  has  the  nice  property  that  the  number  of  edges  in  EG  is 
exactly  the  number  of  edges  in  G.  This  is  so  because  each  node  can  only  send  one  explorer 
on  an  out-edge,  and  each  edge,  being  directed,  can  get  an  explorer  only  from  its  source 
node. 

4.43.  The  Traversal  Spanning  Tree 

Observe  that  if  we  remove  all  the  leaf  nodes  representing  the  termination  of  explorers  at 
visited  nodes,  and  the  edges  directed  into  them,  from  EG.  we  are  left  with  a  tree  in  which 
each  node  of  G  is  represented  only  once,  and  each  edge  is  a  first  edge  to  its  successor  node. 
This  is  exactly  a  spanning  tree  of  G.  We  call  it  a  traversal  spanning  tree,  and,  for  brevity,  a 
P-tree,  since  each  edge  is  a  P-edge.  Thus  we  see  that  the  parallel  traversal  method  guaran¬ 
tees  the  construction  of  a  spanning  tree,  in  which  every  node  is  visited  once.  The  traversal 
execution  graph  not  only  includes  a  spanning  tree,  but  also  an  edge-spanning  tree  in  which 
each  edge  is  traversed.  Note  that  in  a  P-tree,  all  intermediate  nodes  are  P-nodes  and  all  leaf 
nodes  are  non-P  nodes,  since  none  of  their  out-edges  in  EG  are  P-edges. 

This  is  the  main  reason  why  echo  algorithms  will  be  seen  to  be  a  basic  technique  for  dis¬ 
tributed  systems.  It  uses  a  method  of  constructing  a  spanning  tree  in  parallel,  with  commun¬ 
ication  time  just  twice  the  diameter  of  the  graph.  It  is  furthermore  a  method  in  which, 
regardless  of  the  exact  sequencing  of  the  messages,  the  total  number  of  message  passes,  a 
measure  of  overall  work,  is  constant  for  a  given  graph,  and  can  be  precomputed. 

In  a  computer  network  it  may  be  argued  that  once  a  minimum  spanning  tree  has  been 
found,  it  is  the  fastest  way  to  broadcast  a  message  to  all  nodes.  However,  because  of  the 
variability  of  communication  delays,  any  pre-determined  spanning  tree  may  not  in  fact 
represent  the  fastest  current  set  of  paths  which  reach  all  nodes.  On  the  other  hand,  a  pure 
traversal  echo  algorithm  always  takes  the  minimum  amount  of  time  to  span  the  entire 
graph,  and  thus,  in  general,  may  be  expected  to  perform  slightly  better  than  any  given 
minimum  spanning  tree.  This  does  not  presume  that  the  echos  for  the  pure  traversal  also 
take  the  minimum  time  to  return.  Furthermore,  the  echo  algorithm  requires  2e  messages. 
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while  a  minimum  spanning  tree  traversal  only  needs  n  messages. 


4.5.  Specific  Echo  Algorithms 

The  basic  pure  traversal  algorithm  can  be  modified  to  yield  a  large  number  of  decentral¬ 
ized  parallel  graph  algorithms.  Among  these  are  sorting,  largest  finding,  distribution  of  a 
list,  finding  the  minimum  spanning  tree,  detecting  deadlock,  finding  the  biconnected  com¬ 
ponents  of  a  graph,  and  finding  the  shortest  path  tree.  Some  of  these  have  a  Starting 
Number  of  1,  i.e.,  are  single-source,  while  others  are  multi-source.  Recall  that  simultaneous 
initiation  of  the  algorithm  from  several  nodes  is  a  built-in  feature  of  multi-source  algo¬ 
rithms.  For  single-source  algorithms,  every  initiator  invokes  an  independent  execution  of  the 
algorithm.  Typically,  in  a  single-source  algorithm,  the  answer  is  known  by  the  initiator 
only,  whereas  for  multi-source,  the  answer  may  be  distributed  among  all  the  nodes.  We 
make  the  assumption  in  these  algorithms  that  the  names  of  the  nodes  have  a  total  ordering, 
and  that  we  use  the  names  of  the  nodes  for  their  ranks.  In  the  rest  of  this  chapter,  we  will 
present  some  of  the  simpler  algorithms. 

Algorithm  4.1.  Single  Source  Largest  Finder 

Any  node  may  find  out  the  largest  numbered  node  in  the  system.  This  algorithm  applies 
to  for  a  C-graph,  or  the  S-reach  graph  of  any  directed  graph.  We  require  each  echo  to  carry, 
in  addition  to  the  name  of  the  initiator,  a  field  large  containing  the  name  of  a  node.  The 
algorithm  is  as  follows: 

1.  The  initiator  5  starts  the  forward  phase  of  a  pure  traversal  by  sending  explorers  in 
parallel  on  all  its  edges. 

2.  A  node  getting  an  explorer  for  the  first  time  marks  the  edge  as  first,  and  sends  explor¬ 
ers  in  parallel  on  its  other  edges.  If  the  node  is  a  sink,  the  explorer  terminates,  and  the 
node  sends  an  initial  echo  on  its  first  edge. 

3.  A  node  getting  a  subsequent  explorer  sends  an  initial  echo  on  the  edge  on  which  the 
explorer  came.  The  explorer  terminates. 

4.  Let  each  initial  echo  carry  the  name  of  its  origin  in  the  field  large.  (Recall  that  an  ini¬ 
tial  echo  is  an  echo  arising  from  the  termination  of  an  explorer  at  a  node  we  call  the 
origin  of  the  initial  echo.) 

5.  Each  node  takes  the  maximum  of  the  names  carried  by  echos  coming  back,  and  its 
own  name,  as  the  value  of  large  in  its  echo.  Now  try  (6). 

6.  When  all  echos  have  arrived  at  a  node,  the  node  echos  on  its  first  edge,  with  its  echo 
carrying  the  current  value  of  large. 

7.  If  the  node  in  (6)  is  the  initiator,  then  the  maximum  thus  found  is  the  largest  node  in 
the  graph. 

Recall  that  an  execution  graph  is  a  representation  of  a  specific  traversal  of  G,  and  that 
the  movement  of  echos  in  G  can  be  seen  as  the  upward  movement  of  echos  from  the  leaves 
back  to  the  root  in  the  tree  EG.  Then  it  is  easy  to  see  that  what  we  are  doing  is  simply 
finding  the  largest  of  each  sub-tree  successively  as  we  go  back  up  the  tree.  This  algorithm 
can  be  seen  to  operate  in  the  same  number  of  message  passes  and  communication  time  as 
the  traversal  algorithm,  with  each  node  needing  space  of  at  least  one  bit  per  edge.  To 
remember  the  first  edge  at  a  node,  in  which  the  maximum  number  of  edges  for  a  fully 
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connected  graph  is  only  n,  requires  at  most  logn  bits.  The  field  large  also  needs  at  most 
logn  bits.  The  bit  marker  for  each  echo  is  therefore  the  dominant  component  of  storage 
used  at  a  node. 

Algorithm  4.2.  Single-source  Sort 

If  a  node  needs  to  find  out  the  names  of  the  other  nodes  in  the  network,  it  can  get  them 
easily  in  sorted  form.  Again,  the  algorithm  applies  to  C-graphs  and  .S-reach  graphs  of 
directed  graphs.  Each  echo  needs  to  be  able  to  carry  the  names  of  all  the  nodes.  Every  initial 
echo  carries  the  name  of  its  origin.  Explorers  only  need  basic  identification,  i.e.,  the  name  of 
the  initiator,  and  a  type.  Each  node  holds  a  current  list  initially  containing  only  its  own 
name. 

1.  Let  the  initiator  S  start  the  forward  phase  of  a  pure  traversal  by  sending  explorers  in 
parallel  on  its  edges. 

2.  An  explorer  coming  to  a  node  for  the  first  time  marks  the  edge  as  first,  and  sends 

explorers  in  parallel  on  the  other  edges  of  the  node.  If  the  node  is  a  sink,  the  explorer 

terminates,  and  an  initial  echo  is  sent  instead,  on  the  first  edge  of  the  node. 

3.  A  subsequent  explorer  at  a  node  terminates,  and  an  initial  echo  is  sent  on  the  edge  on 
which  the  explorer  arrived. 

4.  Each  initial  echo  carries  the  name  of  its  origin.  . 

5.  As  an  echo  arrives  at  a  node,  the  list  of  names  carried  by  that  echo  is  merged  into  the 

current  list  at  the  node,  deleting  duplicates.  Try  (6). 

6.  After  ail  echos  have  arrived  at  a  node,  it  sends  off  its  echo,  on  its  first  edge,  containing 
its  current  list. 

7.  If  the  node  in  (6)  is  the  initiator,  then  the  current  list  is  the  sorted  list  of  all  the  nodes 

of  the  graph. 

In  considering  the  execution  graph,  clearly  the  algorithm  is  collecting  a  merged  list  of 
the  nodes  in  each  sub-tree,  progressively  towards  the  root.  We  are  not  proposing  this  as  an 
improved  sorting  algorithm,  but  rather  pointing  out  that  a  simple  echo  algorithm  can  per¬ 
form  a  basic  function  effectively. 

In  terms  of  communication  time  and  message  passes,  this  is  the  same  as  a  basic  traver¬ 
sal.  In  terms  of  storage,  each  echo  may  have  to  carry  n  names,  and  hence  each  node  needs 
at  least  2  n  logn  bits,  to  accomodate  its  own  current  list,  and  the  list  carried  by  an  arriving 

echo. 

We  can  improve  the  algorithm  by  a  simple  modification.  Let  only  P-nodes  include  their 
names  in  the  sub-lists  being  constructed.  Since  P-nodes  form  a  spanning  tree,  each  node  is 
included  once  and  only  once  in  any  list.  There  is  no  redundancy,  and  the  number  of  opera¬ 
tions  in  the  merges  is  the  same  as  in  a  conventional  merge-sort.  This  change  is  accom¬ 
plished  by  having  echos  which  arise  from  explorers  at  visited  nodes  carry  an  empty  list. 

Algorithm  4.3.  Multi-source  Biggest  Finder 

In  Chapter  3,  we  saw  the  use  of  extrema  finding  as  a  decentralized  mutual  exclusion 
mechanism  in  a  circular  configuration  of  processes.  We  now  present  a  similiar  algorithm 
which  uses  the  extinction  principle  and  is  applicable  to  arbitrary  configurations.  Consider  a 
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distributed  system  of  processors  modelled  by  a  C-graph.  As  in  the  algorithm  for  election 
seen  in  Chapter  3,  we  use  a  method  of  assigning  priorities  which  produces  unique  values, 
and  gives  higher  priority  to  earlier  requests,  so  that  starvation  will  not  occur.  In  the  descrip¬ 
tion  which  follows,  the  values  associated  with  nodes  are  the  priority  values  thus  assigned.  In 
the  general  case,  however,  the  unique  name  of  a  node  could  also  have  been  used  for  its 
value.  This  would  not  be  desirable  in  using  this  algorithm  for  the  multiple  copy  update  prob¬ 
lem,  for  then  starvation  may  occur. 

Let  each  node  maintain  a  field  called  superior.  Initially,  each  node  is  its  own  superior. 
The  idea  is  that  a  node  will  keep  track  of  the  largest  explorer  ever  encountered  in  the  field 
superior.  Now  every  node  tries  to  perform  a  traversal,  but  the  explorers  and  echos  which 
encounter  larger  superiors  will  be  extinguished.  Only  the  largest  node  can  complete  a  traver¬ 
sal  successfully. 

Assume  every  node  starts  a  basic  traversal.  We  can  adopt  the  technique  of  Algorithm 
3.2.2,  the  Start-up  Variant,  to  handle  staggered  starting,  using  asleep  and  awake  states.  By 
adding  a  third  shutoff  state,  we  can  use  this  algorithm  to  handle  electing  one  of  k  conten¬ 
ders,  as  used  in  the  algorithm  in  Section  3.2.5. 1,  for  the  multiple-copy  update  problem. 

1.  If  the  explorer  coming  to  a  node  is  less  than  the  superior  of  the  node,  do  nothing. 

2.  If  an  explorer  is  greater  than  the  node’s  superior,  then  record  the  initiator  of  the 
explorer  as  the  new  superior  of  that  node,  and  mark  the  edge  as  the  new  first  edge  of 
that  node.  Mark  all  edges  as  not  yet  in  receipt  of  any  echos.  Send  out  explorers  in 
parallel  on  out-edges.  If  the  node  is  a  sink,  echo  along  the  first  edge  instead. 

3.  If  an  explorer  is  equal  to  the  node’s  superior,  then  it  must  be  a  subsequent  explorer. 
Send  an  echo  back  on  that  edge. 

4.  If  an  echo  is  less  than  a  node’s  superior,  do  nothing,  since  the  initiator  of  the  echo 
can’t  be  the  largest.  Thus  its  traversal  cannot  be  completed. 

5.  If  an  echo  is  equal  to  a  node’s  superior,  then  mark  the  edge  as  having  received  its 
echo.  If  all  edges  have  received  echos,  generate  an  echo  to  send  on  the  first  edge  of  the 
node. 

6.  If  the  node  in  (5)  is  the  initiator  of  the  echo,  then  the  algorithm  is  finished,  and  the 
node  is  elected. 

Note  that  if  an  echo  is  larger  than  a  node’s  superior,  then  something  is  drastically 
wrong,  for  this  echo  cannot  have  originated  from  an  explorer  from  this  node. 

If  all  traversals  were  to  succeed,  the  number  of  message  passes  would  be  bounded  by 
4 ne,  since  there  would  be  n  traversals,  each  of  which  needs  at  most  4e  message  passes. 
However,  clearly  a  large  number  of  messages  get  extinguished.  Given  staggered  starting,  at 
best  only  the  messages  from  the  biggest  is  ever  emitted,  the  rest  being  suppressed  by  its  mes¬ 
sages.  Thus,  we  would  need  2e  messages.  In  the  worst  case,  no  matter  what  the  graph,  are 
bounded  above  by  4 ne.  The  average  case  is  difficult  to  analyse,  particularly  because  we 
would  like  the  expected  number  of  messages  not  of  any  particular  graph  configuration,  but 
over  some  average  graph. 

We  have  seen  in  Chapter  3  that  in  the  case  of  a  circular  configuration,  0(n  log n)  mes¬ 
sage  passes  are  needed.  In  Appendix  II,  we  show  the  derivation  of  the  expected  number  of 
messages  for  a  star-graph  and  for  a  fully  connected  graph.  These  are  both  essentially  n  logn. 
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In  a  fully  connected  graph,  the  number  of  edges  is  n(n  + 1)/2.  Thus,  the  expected  number 
of  messages  is  only  of  order  ffe  log  a,  where  e  is  the  number  of  edges.  We  conjecture  that 
for  the  average  case  of  connected  graphs,  we  expect  the  number  of  message  passes  to  be  of 
order  n  log/i.  This  is  strengthened  by  the  observation  that .  a  circular  configuration 
represents  almost  the  sparsest  connected  graph,  while  a  fully  connected  graph  is  the  densest. 
It  may  be  the  case  that  a  fully  connected  graph  has  too  many  edges  which  are  carrying 
information.  Thus,  many  explorers  are  eliminated,  so  that,  in  this  case  only  \Je  edges  is 
needed. 

The  space  needed  is  dominated  by  the  number  of  bits  required  to  mark  echos,  O  ( n ), 
since  everything  else  uses  O  ( log  n )  bits.  However,  if  the  graph  is  sparse,  then  it  may  be 
that  ii’  the  average  number  of  edges  is  of  0(log  n),  then  space  at  each  node  will  be  of 
0(log  n)  bits. 

Communication  time  for  executing  n  simultaneous  executions  of  a  traversal  algorithm 
can  be  considered  to  be  the  same  as  executing  one  algorithm,  if  the  processors  are  very  fast 
compared  to  communications.  Thus,  as  messages  arrive  they  are  handled  with  a  negligible 
loss  of  time.  The  messages  are  sequential  on  any  particular  link,  and  as  long  as  transmission 
rate  is  larger  than  message  generation  rate,  we  will  not  be  backed  up.  We  have  shown  in 
Chapter  1  how  we  might  include  queueing  costs  as  a  component  of  the  unit  of  communica¬ 
tion  time. 

It  should  be  quite  clear  that  the  superior  mechanism  will  always  prevent  every  traversal 
except  one  from  succeeding.  Thus,  only  one  node  will  be  elected.  Note  that  some  process 
must  reset  the  superior  fields  of  all  the  nodes  back  to  be  the  names  of  the  nodes  themselves 
before  another  execution  of  the  algorithm  can  be  successful  if  the  membership  of  the  graph 
changes  between  executions. 

If  we  use  this  algorithm  for  the  multiple-copy  update  problem,  then  following  the  elec¬ 
tion  of  a  node,  it  can  send  its  update  message  to  all  other  nodes,  using  a  parallel  traversal. 
Each  node  receiving  an  update  message  for  the  first  time  will  update  the  resource,  and  send 
the  update  in  parallel  to  its  neighbours.  As  a  node  echos  on  its  first  edge,  it  can  reset  its 
state  to  allow  another  election  to  take  place. 

Algoritfmi  4.4.  Multi-source  Sort 

This  algorithm  has  been  motivated  by  the  multiple-copy  update  problem  presented  in 
Chapter  3.  It  produces  a  distributed  ordering  of  those  nodes  which  wish  to  update,  and  are 
included  in  this  cycle  of  updates.  There  are  two  mechanisms,  one  to  include  candidates  for  a 
particular  cycle,  and  the  other  to  produce  the  distributed  ordering. 

The  inclusion  mechanism  works  as  follows.  Each  node  has  a  status  which  is  either 
asleep,  awake,  or  shutoff.  Initially,  all  nodes  are  asleep.  Some  node  that  spontaneously 
wishes  to  start  turns  itself  awake.  An  explorer  coming  to  an  awake  node  considers  that 
node  included.  An  explorer  coming  to  an  asleep  node  turns  it  to  shutoff,  and  clearly  an 
explorer  at  a  shutoff  node  considers  it  excluded.  Note  that  a  shutoff  node  does  not  partici¬ 
pate  in  the  node  ordering,  but  must  act  to  receive  echos  and  send  explorers,  since  it  may  be 
intermediate  in  the  path  between  two  awake  nodes. 

To  do  the  ordering  of  nodes,  each  echo  keeps  two  fields:  larger  and  smaller.  Initially, 
larger  contains  some  largest  implementation  number  TOP,  and  smaller  some  smallest 
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implementation  number  BOT.  Each  included  node  performs  a  basic  traversal,  and  among 
the  included  nodes  it  encounters,  finds  the  smallest  node  larger  than  itself  and  the  largest 
node  smaller  than  itself.  When  the  algorithm  is  finished,  the  largest  of  included  nodes  will 
know  that  fact  because  its  larger  remains  as  TOP,  while  the  smallest  of  the  nodes  has  its 
smaller  value  unchanged  from  BOT. 

1.  Let  an  asleep  node  wishing  to  participate  turn  itself  awake,  and  send  its  explorers  on 
all  out-edges. 

2.  An  explorer  coming  to  an  asleep  node  turns  it  to  shutoff. 

Note:  From  this  point,  consider  only  the  activities  for  a  particular  family  of  messages,  ori¬ 
ginating  from  a  particular  initiator.  Thus,  an  explorer  coming  to  a  node  on  a  P-edge  means 
an  explorer  of  that  family  visiting  a  node  for  the  first  time. 

3.  An  explorer  on  a  P-edge  to  a  node  marks  the  node  as  having  been  visited  by  that  fam¬ 
ily,  the  first  edge  for  that  family,  then  sends  out  more  explorers.  If  the  node  is  a  sink, 
then  if  it  is  a  awake  node,  (5)  is  done,  else  if  it  is  an  shutoff  node,  (4)  is  done. 

4.  An  explorer  coming  to  a  visited  shutoff  node  creates  an  initial  echo,  carrying  the  ini¬ 
tiator  name,  with  larger  =  TOP,  and  smaller  =  BOT.  The  node  sends  the  echo  back. 

5.  An  explorer  coming  to  a  visited  awake  node  creates  the  same  echo,  but  if  the  node  is 
smaller  than  the  initiator,  places  the  node  name  in  smaller  and  if  larger,  places  it  in 
larger.  The  node  echos. 

6.  As  a  node  receives  echos  for  a  family,  it  constructs  an  echo  which  will  contain  the 
largest  of  the  [smaller]  and  the  smallest  of  the  [larger]  fields  in  the  echos.  After  all  the 
echos  have  arrived,  an  awake  node  tries  to  place  its  own  name  in  either  the  larger  or 
the  smaller  field  of  the  constructed  echo.  Then  it  echos  along  its  first  edge.  A  shutoff 
node  simply  echos  on  its  first  edge. 

7.  If  the  node  in  (6)  is  the  initiator  of  the  echos  in  question,  then  the  algorithm  ter¬ 
minates  for  that  initiator.  The  initiator  that  is  largest  among  awake  nodes  identifies 
itself  by  finding  that  its  larger  field  still  contains  TOP. 

This  algorithm  contains  k  parallel  executions  of  a  modified  basic  traversal.  In  thinking 
of  echos  coming  back  up  the  sub-trees  of  the  execution  graph,  all  nodes  are  involved  for 
echoing  purposes,  but  only  awake  nodes  compete  for  being  either  the  largest  of  the  nodes 
smaller  than  the  initiator,  or  the  smallest  of  the  larger  nodes.  The  result  of  this  algorithm  is 
that  a  doubly  linked  list  of  the  sorted  nodes  is  found  and  stored  in  distributed  form,  among 
the  larger  and  smaller  fields  of  the  k  involved  nodes. 

Using  the  assumptions  about  communications  for  Algorithm  4.3,  the  elapsed  time  for 
this  algorithm  is  just  0(D),  since  there  are  k  traversals,  but  each  traversal  occurs  in  parallel. 
Furthermore,  the  number  of  message  passes  in  total  is  bounded  by  4 ke,  since  a  single  traver¬ 
sal  needs  at  most  4e  message  passes. 

Finally,  each  node  needs  at  most  n  bits  to  mark  echo  arrivals  for  each  family,  in  the 
worst  case  of  a  fully  connected  graph.  In  addition^  for  each  family,  each  node  needs  to 
maintain  an  echo  to  hold  intermediate  results.  This  is  the  name  of  three  nodes  and  only 
needs  (3  log  )  bits.  Therefore,  storage  is  bounded  by  n  +  Slog  n  bits  at  a  node,  for  each 
family.  For  the  entire  algorithm,  then,  we  need  nk  +  Sk  log  n  bits.  Since  k  is,  at  most  n ,  the 
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storage  needs  are  of  O  (n  2 ). 

If  we  were  to  use  this  technique  for  the  multiple  copy  update  problem,  the  largest  con¬ 
tender  would  start  by  sending  an  update  using  parallel  traversal  to  all  nodes.  We  would 
adopt  Ellis’  technique  [ELLI  77b]  of  using  two  types  of  update  messages,  an  update  and  an 
update  final.  When  all  echos  for  an  update  traversal  have  returned  to  the  initiator  of  the 
update,  it  transfers  control  to  the  node  described  in  its  smaller  field.  If  it  is  the  last  node  to 
update,  then  its  smaller  contains  the  implementation  number  BOT,  and  the  node  sends  out 
update  final  messages.  All  other  nodes  which  initiate  an  update  use  update  messages.  Nodes 
which  echo  following  the  execution  of  update  final  messages  will  reset  their  states  to  asleep , 
so  that  another  set  of  updates  can  be  performed. 

Algorithm  4.5.  Distribution  of  a  List 

Consider  a  C-graph  with  n  edges.  Let  some  node  be  given  a  list  with  n  distinct  numbers, 
and  let  its  task  be  the  distribution  of  these  numbers  to  the  nodes  of  the  graph  such  that  each 
node  gets  a  different  number.  Obviously,  the  node  given  the.  list  can  be  the  initiator  of  the 
distribution.  However,  it  does  not  know  how  many  to  pass  to  each  edge.  One  traversal  of 
the  graph  can  establish  the  distribution  information  throughout  the  graph,  however,  and  in 
the  second  phase  of  the  alogrithm,  the  task  is  simply  at  each  node  to  send  out  items  on  each 
edge  according  to  the  information  stored  in  each  node. 

Let  the  initiator  node  be  S.  Consider  any  traversal  of  G  from  S,  and  its  execution  graph. 
This  contains  a  spanning  tree  of  G.  as  described  in  Section  4.4.3.  Each  edge  in  the  spanning 
tree  is  a  first  edge  to  its  successor  node,  and  there  are  exactly  n  nodes  in  this  tree.  We  use 
the  echos  from  this  tree  to  compute  at  each  node  how  many  numbers  to  send  down  which 
edges. 

1.  Let  the  initiator  node  send  explorers  in  parallel  on  its  edges. 

2.  Consider  the  P-tree  induced  by  the  traversal  of  G.  Every  edge  of  this  tree  is  a  P-edge, 
and  a  first  edge  to  its  successor  node.  Let  each  echo  which  comes  back  on  a  P-edge 
bear  a  special  marking,  and  call  them  P-echos. 

3.  The  leaf  nodes  of  the  P-tree  are  non-P  nodes.  They  have  no  out-edges  which  are  P- 
edges.  Let  each  P-echo  from  a  non-P  node  carry  a  number  field  with  value  1 . 

4.  At  each  P-node,  record  against  each  out-going  P-edge  the  number  carried  back  by  its 
corresponding  P-echo. 

5.  Furthermore,  as  each  P-echo  comes  back  to  a  P-node,  accumulate  the  number  values 
at  the  node.  When  all  the  echos  have  come  back,  send  out  a  P-echo  along  the  first 
edge  of  the  node,  with  number  field  equal  to  the  cumulated  sum  plus  one  (for  itself). 

6.  The  initiator  node  will  then  get  back  from  each  of  its  P-edges  a  number  which  tells  it 
how  many  items  to  send  down  that  edge.  It  keeps  one  item  for  itself. 

7.  Each  successive  node  will  get  a  list  of  items.  It  has  recorded  the  number  to  send  on 
each  of  its  P-edges,  after  keeping  one  for  itself. 

8.  The  algorithm  terminates  when  all  items  have  been  distributed  at  P-nodes  which  are 
the  leaves  of  the  P-tree.  Note -than  we  can  impose  an  additional  echo  scheme  to  allow 
the  initiator  to  know  when  the  distribution  has  been  completed. 
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This  algorithm  requires  execution  time  bounded  by  two  traversals  of  the  graph,  one  to 
create  a  spanning  tree  and  one  to  distribute  the  list.  Thus  the  number  of  messages  is  boun¬ 
ded  by  4e.  The  storage  at  each  node  is  the  identity  of  its  P-edges,  the  number  for  each  P- 
edge,  and  one  bit  for  each  echo.  This  is  of  order  0(n  logn)  bits. 

4.6.  Comparisons  with  other  Multiple  Copy  Update  Algorithms 

In  Chapter  3  we  have  described  several  election  methods,  based  on  extinction,  which 
implement  decentralized  mutual  exclusion  in  distributed  systems  configured  as  circles.  In 
Chapter  4,  we  have  extended  these  methods  to  systems  configured  as  general  graphs.  There 
are  basically  two  algorithms:  electing  one  contender  at  a  time,  and  sequencing  ^-contenders 
in  one  pass.  We  have  seen  that  these  algorithms  have  application  to  the  multiple  copy 
update  problem.  In  this  section,  we  will  compare  some  features  of  these  algorithms  to  other 
methods  which  have  been  proposed  in  the  literature  for  solving  this  problem.  There  have 
been,  in  fact,  rather  a  large  number  of  solutions.  We  will  take  six  typical  approaches,  those 
of  Thomas  [THOM  75],  Rosenkrantz  [ROSE  77],  Ellis  [ELLI  77b],  Gelenbe  and  Sevcik 
[GELE  78],  LeLann  [LELA  78]  and  Montgomery  [MONT  78].  We  will  comment  both  on 
the  differences  in  qualitative  nature  of  the  algorithms,  as  well  as  on  some  performance 
measures. 

First,  we  will  briefly  summarize  these  methods.  Thomas’  algorithm  uses  a  majority  con¬ 
sensus  of  nodes  voting  on  update  requests.  A  node  voting  OK  will  record  the  update  as 
pending.  If  the  update  is  rejected,  the  pending  update  must  be  rolled  back.  A  node  which 
votes  OK  and  produces  a  majority  vote  on  an  update  will  perform  the  local  update,  then 
send  DO-updates  to  all  other  nodes.  This  method  is  intended  to  be  robust  in  the  face  of 
failures. 

The  Rosenkrantz  algorithm  is  essentially  a  locking  scheme  in  which  arr  update  process 
moves  from  node  to  node  making  temporary  updates.  If  it  finds  a  conflict  with  another  pro¬ 
cess,  one  of  them  must  roll  back  all  its  temporary  updates.  If  a  process  succeeds  in  updating 
all  the  nodes,  it  takes  another  pass  to  make  these  updates  permanent. 

Ellis’  method  presumes  a  circle  of  n  nodes,  arranged  in  order  of  priority.  Messages 
travel  in  the  direction  of  increasing  priority.  Each  node  is  allowed  to  update  when  the 
request  to  update  message  it  issues  returns  to  it.  This  occurs  in  delayed  fashion,  for  lower 
priority  updates  are  held  temporarily  by  higher  priority  nodes.  When  a  node  completes  its 
updates,  it  allows  the  request  message  it  is  holding  to  continue.  Clearly,  the  highest  priority 
node  gets  to  update  first,  then  the  next,  and  so  on. 

Gelenbe  and  Sevcik  have  studied  several  different  policies  for  applying  updates  as  they 
are  received  at  a  node.  Each  update  is  time-stamped  at  its  node  of  origin,  given  a  sequence 
number  at  the  node,  and  broadcast.  They  show  that  the  policy  of  immediate  updating  will 
require  roll  backs.  A  node  can  also  wait  until  it  has  received,  from  all  other  nodes,  messages 
which  are  later  than  the  time  on  its  update,  and  is  also  sure  that,  from  their  sequence  num¬ 
bers,  no  messages  are  missing.  If  some  nodes  do  not  send  update  messages,  however,  it  may 
take  a  long  time  before  this  condition  is  realized.  Thus,  to  improve  elapsed  time,  status  mes¬ 
sages  containing  the  last  update  originating  at  each  node  can  be  broadcast  periodically. 

LeLann’s  technique  uses  a  circulating  control  token  in  a  virtual  ring,  which  issues  a  ticket 
at  each  node  which  wishes  to  update  a  common  resource.  The  node  can  then  broadcast  the 
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update,  and  at  each  node,  since  the  ticket  number  of  the  last  update  is  known,  the  updates 
can  be  applied  in  the  correct  sequence,  as  they  arrive. 

Montgomery  has  proposed  a  method  of  atomic  broadcasting  to  all  nodes,  so  that  con¬ 
current  broadcasts  from  several  nodes  are  guaranteed  to  arrive  at  the  destinations  in  the 
same  order.  This  is  achieved  by  predefining  a  hierarchy  of  all  the  nodes  in  the  system,  and 
requiring  the  highest  node  which  is  the  parent  of  the  destination  nodes  of  a  broadcast  to 
actually  do  the  message  forwarding. 

4.6.1.  Qualitative  Comparisons 

Decentralized  election  algorithms  are  simple,  dynamic,  parallel  and  local.  Consider  sim¬ 
plicity  first.  All  multiple  copy  update  algorithms  have  the  goal  of  producing  a  consistent 
sequencing  of  updates  at  all  copies.  This  is  achieved  in  our  methods  by  globally  sequencing 
the  updates,  so  that  all  nodes  first  complete  one  update,  and  then  the  next,  and  so  on.  This 
sequencing  is  done  in  a  simple  way,  by  election.  When  the  sequencing  is  based  on  incremen¬ 
tal  locking,  as  in  Thomas  and  Rosenkrantz,  strategies  such  as  abort,  roll  back  and 
confirmation  become  necessary.  The  Gelenbe  algorithm  requires  each  node  to  keep  track  of 
a  set  of  pending  previous  updates,  as  well  as  the  last  times  at  which  each  node  last  updated. 
The  LeLann  algorithm  is  simpler,  but  has  the  cost  of  supporting  a  virtual  ring,  and  of  keep¬ 
ing  out  of  sequence  updates  around  until  they  can  be  applied. 

Montgomery’s  algorithm  requires  a  predefined  structure  on  the  processors  to  be  esta¬ 
blished  in  addition  to  the  original  network  configuration.  In  a  sense,  this  is  also  true  of  the 
virtual  ring  of  LeLann,  and  Thomas’  daisy  chain.  Although  Ellis  does  not  say  so  explicitly, 
his  algorithm  appears  to  assume  the  context  of  a  circle  of  processors.  This  requirement  for 
added  structure  is,  we  have  said,  a  weak  form  of  central  control.  The  circulating  control 
token  of  LeLann,  although  distributed  in  one  sense,  is  also  in  many  ways  a  global  resource. 
There  may  be  a  trade-off  between  the  extra  messages  and  time  required  for  strictly  decen¬ 
tralized  algorithms,  and  the  problems  in  algorithms  with  centralization  of  choosing  a  priori 
the  best  organization  of  processors,  or  in  the  case  of  Montgomery,  the  problems  associated 
with  the  failure  of  nodes  high  in  the  hierarchy. 

Thomas’algorithm  is  completely  sequential,  in  that  each  node  votes  on  an  update  in 
turn.  The  Rosenkrantz  algorithm  is  not  necessarily  sequential,  for  an  echo  algorithm  could 
send  the  update  process  from  node  to  node  in  parallel.  However,  neither  Rosenkrantz  nor 
Gelenbe  consider  these  explicitly,  except  that  in  Gelenbe,  broadcasting  is  assumed.  Ellis’  is 
parallel  in  the  first  phase,  of  allowing  the  largest  to  go  while  stopping  the  others,  then 
sequential  from  there.  In  Gelenbe  and  LeLann,  once  all  updates  have  arrived  at  a  node,  all 
of  them  can  be  applied.  Thus,  potentially,  a  great  deal  of  parallelism  is  possible.  In 
Montgomery,  parallel  activity  is  implicit  in  the  leaves  of  the  hierarchy,  as  messages  go 
toward  the  root.  Once  the  updates  have  arrived  at  their  destinations,  they  are  certain  to  be 
in  the  right  order,  and  can  be  applied  immediately. 

Decentralized  algorithms  are  local  in  the  sense  that  no  need  exists  for  knowing  either  the 
number  of  nodes  in  the  graph,  or  the  configuration  of  the  graph.  Thomas  and  Gelenbe  must 
know  the  size  of  the  graph.  Although  Ellis  does  not  require  this  knowledge,  it  is  highly 
dependent  on  structure.  Both  the  tree  structure  of  Montgomery  and  the  virtual  ring  struc¬ 
tures  require  a  complete  knowledge  of  the  original  graph  initially  in  creating  them.  Subse¬ 
quently,  each  node  needs  only  to  know  of  its  adjacent  neighbours. 


4-16 


4.6.2.  Performance  Measures 

We  will  now  attempt  to  characterize  very  roughly  some  performance  measures.  These 
are  in  terms  of  number  of  messages  and  elapsed  time.  For  simplicity  we  will  use  a  circle  of  n 
processors  as  our  model,  partly  because  some  of  the  algorithms  assume  such  a  configuration. 
We  will  consider  a  single  update,  and  also  k  concurrent  conflicting  updates.  Note  that  in  the 
latter  case,  the  throughput  is  just  the  inverse  of  the  average  time  per  update.  It  must  be 
recognized  that  these  algorithms  are  different  qualitatively,  both  in  terms  of  mechanisms 
used,  and  in  other  services  offered,  even  though  they  share  the  common  goal  of  coordinating 
concurrent  activity  in  distributed  databases.  For  example,  Thomas  deals  with  robustness  in 
the  face  of  various  kinds  of  failures,  which  Gelenbe  and  Ellis  do  not.  Keeping  these  relative 
differences  in  mind,  what  follows  can  only  be  crude  measures  of  relative  performances.  They 
are  not  meant  to  be  a  definitive  ranking  of  the  algorithms,  which  cannot  be  done  without 
taking  their  qualitative  characteristics  into  account.  For  ease  of  comparison,  we  use  the 
assumptions  of  our  model,  that  traversal  times  on  edges  are  approximately  the  same. 

Consider  a  single  update  first.  In  the  decentralized  election  methods,  single  election  uses 
n  messages  and  n  time  for  one  election,  and  then  n  messages  and  n  time  for  the  update. 
Thus,  total  number  of  messages  is  2 n,  and  elapsed  time  is  also  2 n.  For  the  ^-sequencing 
algorithm,  a  single  update  produces  exactly  the  same  behaviour,  2 n  for  both  messages  and 
time.  For  Thomas,  only  n  /2  messages  are  needed  for  a  majority  vote,  and  then  n  messages 
are  needed  for  the  update.  Thus,  time  and  number  of  messages  is  3 n  /2.  The  Rosenkrantz 
algorithm  will  need  n  messages  for  the  update,  another  n  for  the  confirmation,  and  thus  total 
time  and  number  of  messages  is  2 n.  Ellis  will  need  n  to  get  to  update  and  n  for  the  update. 
Again,  this  is  a  2 n  message  and  time  algorithm. 

Gelenbe’s  is  more  difficult.  Assume  messages  must  go  around  the  circle.  Then  it  takes  n 
time  for  the  update  to  get  to  all  nodes,  using  n  messages.  Before  a  node  can  update, 
though,  it  must  know  that  there  are  no  earlier  updates.  At  best,  just  after  a  node  forwards 
an  update,  it  sends  a  status  message  to  all  other  nodes.  Thus,  the  total  time  is  n,  but  another 
n 2  messages  are  needed.  Strictly  speaking,  it  is  not  true  that  every  update  costs  n  2  mes¬ 
sages.  Nevertheless,  for  a  single  update  to  go  in  best  time,  this  is  the  number  that  must  have 
occurred  before  the  update  can  be  completed.  We  will  see  that  the  more  updates  there  are, 
the  less  this  overhead  costs  per  update. 

The  LeLann  algorithm  would  take  n  12  time,  on  the  average,  for  the  control  token  to  get 
to  the  node  wishing  to  update,  and  then  n  time  for  the  update  to  get  everywhere.  The  total 
time  can  be  reduced  to  n  if  we  use  the  more  complex  method  of  predicting  usage  and  put¬ 
ting  a  supply  of  tickets  at  every  node  for  future  use.  However,  we  will  restrict  ourselves  to 
the  simpler  algorithm.  Clearly,  this  will  only  need  n  messages.  Of  course,  we  should  include 
the  at  least  n  /2  edge  movements  of  the  control  token,  for  a  total  of  3 n  /2  message  passes. 
The  Montgomery  algorithm  must  send  the  update  to  the  top  of  the  tree,  since  all  nodes  are 
involved.  It  is  difficult  to  analyse  this  algorithm,  for  we  have  to  include  the  cost  of  imposing 
a  tree  on  a  circle!  Assuming  a  tree  with  one  root,  and  two  linear  lists  as  descendants,  then 
on  the  average,  an  update  goes  n  /4  to  the  root,  and  then  takes  n  /2  time  in  parallel  to  get  to 
all  nodes.  Thus,  messages  needed  are  5A/4,  while  time  is  3n  /4.  However,  these  are  very 
crude  estimates. 

These  results  are  reproduced  in  the  following  table.  We  can  see  that  the  use  of  central¬ 
ized  control  produces  slightly  better  performance  than  the  others.  The  Gelenbe  algorithm  is 
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fast,  but  (needs)  lots  of  messages.  For  most  of  the  algorithms  considered  in  this  case,  the 
limiting  factor  is  the  time  and  number  of  messages  required  to  do  the  updates  themselves. 

Single  Update 


messages 

time 

single  election 

2  n 

2n 

^-sequencing 

2  n 

2n 

Thomas 

In  12 

3  n 

Rosenkrantz 

2  n 

2n 

Ellis 

2  n 

2  n 

Gelenbe 

(n2) 

n 

LeLann 

3/i/2 

3/i 

Montgomery 

5/i/4 

3n 

Now  consider  k  contenders,  first  in  the  decentralized  elections.  From  Chapter  3,  the 
number  of  message  passes  using  k  separate  elections,  is  nk  log  k  for  the  elections,  and  nk  for 
the  updates.  Therefore  each  update  takes  n(log  k  +  1)  messages.  Each  election  takes  n  time, 
as  does  each  update.  Therefore,  the  total  elapsed  time  is  2 nk,  and  the  mean  time  to  update 
is  2 n.  If  we  used  one  ^-sequencing,  the  number  of  message  passes  is,  from  Chapter  3,  nk  for 
the  sequencing,  nk  / 2  for  transferring  control  between  contenders,  and  nk  for  the  updates. 
Thus,  each  update  requires  5n  /2  messages.  The  time  needed  is  n  for  sequencing,  nk  /2  for 
the  transfer  of  control,  and  nk  for  updating.  Thus,  the  mean  time  for  each  update  is  approx¬ 
imately  3 n  / 2. 

Consider  the  Thomas  algorithm  in  which  k  conflicting  updates  happen  concurrently. 
Assume  the  first  one  gets  to  go  in  n  /2  message  passes,  while  the  rest  are  being  rejected  in 
n  / 2  message  passes.  Thus,  for  the  first  update  to  proceed,  a  total  of  kn  /2  message  passes 
were  used.  The  second  would  use  (k  —  \)n/2,  the  next  ( k—2)n/2 ,  etc.  Thus,  for  all  of  them, 
nkl/2  messages  would  be  used  just  to  get  the  updates  to  go,  without  counting  roll  backs, 
etc.  Each  update  also  takes  n  messages.  Therefore,  total  messages  is  about  nk  2/2,  and  each 
update  needs,  on  the  average,  nk  / 2  messages.  The  time  required  to  allow  a  majority  vote  is 
n  /2,  and  n  for  the  updates.  Therefore,  each  update  needs  3 n  /2.  The  difficult  aspect  of  this 
algorithm  is  that  although  it  takes  n  /2  votes  to  accept  an  update,  it  takes  just  as  many  to 
reject  it.  If  the  algorithm  proceeds  in  parallel,  with  k  conflicting  updates,  clearly  only  one 
can  go,  while  the  others  are  either  rejected,  or  close  to  being  rejected.  In  an  asynchronous 
system,  it  becomes  difficult  to  estimate  this  behaviour,  except  very  approximately. 

For  the  Rosenkrantz  algorithm,  each  update  and  confirmation  will  need  2 n  messages, 
but  for  all  of  them,  there  are  also  the  rollbacks  of  partial  updates  for  those  which  did  not 
succeed.  Assume  for  each  update  that  the  others  partially  update  n  /2  nodes,  without  know¬ 
ing  the  sequence  in  which  the  k  updates  could  have  been  rejected  in  turn  in  various  parts  of 
the  circle.  Then,  the  messages  to  account  for  partial  update  and  rollback  is  n  per  successful 
update.  Therefore,  the  total  number  of  messages  per  update  is  3 n.  The  elapsed  time  is  2 n  for 
each  message  and  confirmation. 

In  considering  the  Ellis  algorithm,  note  that  each  update  must  send  its  request  all  the 
way  around  before  it  can  update.  Given  the  assumptions  of  a  strict  ordering  of  nodes  by 
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These  results  are  summarized  in  the  table  below. 


k-Contenders 

messages  per  update  mean  time  per  update 


single  election 

n  log  k 

2  n 

^-sequencing 

5n/2 

3n/2 

Thomas 

kn  /2 

3  n  12 

Rosenkrantz 

3  n 

2  n 

Ellis 

2  n 

2  n 

Ellis  variant 

n  +  (n  log  k  )/k 

n  +2 n  Ik 

Gelenbe 

C n2/k ) 

n  Ik 

LeLann 

n 

2 n  Ik 

Montgomery 

5  n  /4 

n/k 

4.6.3  Discussion 

The  most  striking  observation  in  these  comparisons  is  that  those  algorithms  which 
update  in  strict  global  sequence  are  inferior  to  algorithms  which  can  update  simultaneously, 
at  each  node,  in  the  same  sequence.  To  be  able  to  do  this,  the  Montgomery  and  LeLann 
algorithms  use  global  control  mechanisms,  for  which  other  kinds  of  costs,  not  included  in 
these  simple  observations,  are  exacted.  For  example,  in  LeLann,  there  is  the  continual 
overhead  of  the  circulating  control  token,  as  well  as  the  imposition  of  a  virtual  ring.  In 
Montgomery,  there  is  the  need  to  have  a  predetermined  hierarchy  of  processors.  For 
Gelenbe,  there  is  either  the  cost  in  time,  waiting  for  later  messages  to  come  from  all  other 
nodes  before  a  node  applies  an  update,  or  there  is  the  overhead  cost  of  supplying  status  mes¬ 
sages,  which  are  n  2  each  time.  Finally,  all  the  simultaneous  update  algorithms  must  provide 
the  storage  space  for  n  messages,  which  is  not  necessary  in  any  of  the  update  in  sequence 
algorithms.  These  other  costs  have  not  been  included  in  our  simple  comparisons,  and  more 
detailed  and  comprehensive  treatment  will  be  interesting  further  research. 

Decentralized  election  algorithms  are  average  in  the  performance  measures  we  have 
used,  with  the  single  election  method  clearly  inferior  to  ^-sequencing.  However,  in  terms  of 
simplicity  and  flexibility,  and  a  minimal  number  of  assumptions  made  about  the  structure 
and  extent  of  the  graph,  they  appear  to  be  very  attractive.  Their  design  has  also  been  used  to 
study  some  difficult  questions  such  as  the  nature  of  parallel  traversal  and  of  decentralized 
control.  We  propose  these  algorithms  as  one  of  many  alternative  methods  of  solving  the 
multiple  copy  update  problem,  each  of  which  has  some  features  which  would  make  it  more 
attractive  in  certain  situations  than  in  others. 

4.7.  Reliability  Considerations 

In  this  section,  we  will  discuss  the  effect  of  failures  of  nodes  and  links  on  echo  algor¬ 
ithms.  Since  they  are  based  on  parallel  traversal,  we  will  first  examine  failures  with  refernce 
to  the  pure  traversal  algorithm.  Following  that,  we  will  consider  problems  specific  to  parti¬ 
cular  algorithms.  In  our  system,  message  loss  occurs  only  with  node  failures.  However,  we 
can  have  failure  of  links  as  well  as  failure  of  nodes.  In  contrast  to  the  circular  case,  in  which 
we  did  not  assume  that  a  node  failure  meant  a  loss  of  communication  between  its  neigh¬ 
bours,  we  will,  for  the  general  graph,  take  the  failure  of  a  node  to  include  the  failure  of 
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communications  between  its  immediate  neighbours  via  itself. 

We  consider  only  permanent  failures,  for  temporary  ones,  if  they  can  be  distinguished  as 
such,  are  best  treated  by  simply  waiting  for  recovery.  If  a  node,  or  the  link  to  it,  fails  before 
an  explorer  arrives,  it  is  as  if  the  graph  did  not  contain  that  node,  or  its  edges.  Noting  that 
the  sender  node  has  edge  information  which  is  no  longer  accurate,  we  could  update  the  edge 
information  at  the  sender,  and  consider  that  we  have  a  new  graph.  The  echo  algorithm  will 
function  for  this  new  graph.  Of  course,  a  node  could  be  unable  to  send  an  explorer  to  a  node 
which  has  failed  but  had  already  received  a  previous  explorer.  Even  then  the  update  and 
continuation  would  be  okay,  for  the  real  predecessor  of  the  failed  node  would  detect  its  loss 
as  described  below. 

If  a  node  fails  after  getting  an  explorer,  either  before  it  has  sent  out  its  own  explorers,  or 
after,  but  in  any  case  before  it  has  echoed,  then  its  predecessor  never  get  its  expected  echo, 
nor  can  the  failed  node  receive  echos  from  its  successors.  If  a  node  fails  before  it  has 
echoed,  its  predecessor  must  time-out  or  use  a  reassurance  mechanism  to  detect  this  failure. 
Time-out  at  the  predecessor  node  would  cause  an  “OK?”  query  to  be  sent  to  the  node,  while 
a  reassurance  mechanism  would  require  a  node  periodically  to  send  an  “OK”  message  to  its 
predecessor,  on  its  first  edge.  A  node  which  detects  failure  in  this  situation,  or  a  node  which 
cannot  send  an  echo  to  a  failed  node,  must  ABORT,  either  by  informing  the  originator  of 
the  algorithm,  or  by  taking  the  responsibility  upon  itself.  The  algorithm  can  then  be  res¬ 
tarted  by  the  originator. 

If  a  node  fails  after  it  has  echoed,  the  initiator  node  will  not  know  of  this  unless  it  subse¬ 
quently  tries  to  communicate  to  the  node,  such  as  in  the  sequenced  update  algorithm.  This 
problem,  however,  is  not  one  which  is  unique  to  decentralized  echo  algorithms,  and  admits 
of  no  easy  solutions. 

The  above  discussion  applies  to  all  single-source  echo  algorithms  which  send  explorers 
and  then  wait  for  echos.  For  the  multi-source  biggest  finder,  all  nodes  except  one  legitima¬ 
tely  do  not  complete  a  pure  traversal.  To  distinguish  between  this  situation,  and  waiting  on 
a  failed  node,  a  time-out  or  reassurance  mechanism  must  be  used.  When  node  failure  is 
detected  in  this  way,  the  algorithm  must  be  aborted.  This  raises  the  complex  question  of 
whether  every  node  detecting  failure  sends  an  ABORT  to  all  nodes,  and  when  nodes  can  res¬ 
tart  the  algorithm.  Furthermore,  nodes  which  detect  failed  nodes  because  they  cannot  send 
explorers  or  echos  to  them  should  also  cause  an  ABORT.  It  appears  that  following  the 
detection  of  a  failure,  then,  all  nodes  should  go  through  a  reconfiguration  to  get  adjacency 
information  updated,  before  the  original  task  is  retried.  This  also  raises  the  interesting  prob¬ 
lem  of  the  protocol  through  which  a  failed  node,  or  a  new  node,  might  use  to  gain  entry  to 
the  system.  These  will  be  important  topics  for  future  research. 

One  possible  response  to  a  node  detecting  a  failed  node  in  the  multiple-source  election 
algorithm  is  as  follows:  let  us  consider  that  each  node  has  three  states,  work,  abort  and  set. 
Initially,  every  node  is  in  the  work  state.  Let  there  be  three  types  of  messages:  WORK, 
which  are  normal  to  the  execution  of  algorithms,  ABORT,  and  SET.  A  node  in  work  state 
enters  abort  state  either  on  detecting  a  failed  node,  or  on  getting  an  ABORT  message. 
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1.  In  work  state,  enter  abort  state  if  a  node  failure  is  detected  or  an  ABORT  message  is 
received. 

2.  On  entering  abort  state,  send  an  ABORT  message  to  all  immediate  neighbours.  A  new 
adjacency  table  is  constructed  at  the  node  by  including  only  those  neighbours  which 
received  an  ABORT  message  successfully.  Enter  the  set  state.  In  abort  state,  ignore 
(after  acknowledgment)  all  WORK  messages  and  all  other  ABORT  messages.  All 
SET  messages  are  also  ignored,  except  those  preceded  immediately  by  an  ABORT 
message  on  an  edge.  Such  edges  are  marked  as  having  received  a  SET  message.  If  a 
work  message  follows  the  SET  message  on  that  edge,  queue  up  the  work  message  for 
that  edge. 

3.  In  set  state,  first  send  a  SET  message  to  all  its  neighbour?-.  Then  wait  for  all  its  edges 
to  receive  a  SET  message.  When  this  has  happened,  the  node  can  enter  work  state, 
and  either  restart  the  algorithm  locally,  or  deal  with  queued  WORK  messages.  While 
in  set  state,  all  ABORT  messages  are  ignored.  WORK  messages  are  also  ignored, 
unless  they  arrive  at  an  edge  on  which  a  SET  message  has  already  arrived.  Note  that 
such  a  SET  message  must  also  have  been  preceded  by  an  ABORT  message.  The 
WORK  messages  from  re-started  nodes  are  queued.  If  a  neighbour  of  a  node  in  set 
state  fails,  the  node  must  time-out  on  waiting  for  its  set  message  on  that  edge,  and 
send  a  query.  If  it  detects  node  failure,  it  remains  in  set  state,  updates  its  adjacency- 
table,  and  continues  to  wait  for  a  full  complement  of  SET  messages  from  its  known 
neighbours. 

This  three-state  decentralized  ABORT  and  restart  system  requires  the  set  state  as  a 
buffer  between  a  previous  cycle  of  aborts,  and  the  current  execution.  Thus,  a  node  which  has 
restarted  will  be  guaranteed,  because  of  the  buffer  set  state,  not  to  receive  an  ABORT  mes¬ 
sage  from  a  previous  ABORT  cycle.  This  is  so  because  a  node  can  restart  only  after  it  has 
received  SET  messages  from  all  neighbours,  and  this  can  happen  only  after  it  receives 
ABORT  messages  from  all  of  them.  Thus,  after  a  restart,  any  ABORT  message  received 
must  correctly  refer  to  a  current  execution.  This  method  also  has  the  nice  property  that 
even  if  several  nodes  were  to  fail  such  that  the  graph  becomes  split  into  disjoint  sub-graphs, 
all  of  them  will  properly  ABORT,  and  re-enter  the  work  state. 

For  the  multiple-source  sequencing  algorithm,  the  failure  of  nodes  present  a  problem 
similiar  to  that  of  the  multiple-source  election  algorithm.  The  technique  described  above  can 
be  used  to  restart  the  sequencing  algorithm  if  it  fails  in  progress.  To  deal  with  the  problem 
of  node  failures  during  a  distributed  update  itself,  some  mechanism  like  Lampson’s  atomic 
distributed  update  [LAMPS  76]  must  be  invoked. 

4.8  Conclusion 

Some  of  the  terminology  in  this  chapter  has  been  adapted  from  Dijkstra  [DIJK  79a],  In 
particular,  the  P-edge  and  P-tree  properties  of  a  traversal  were  clarified  by  this  communica¬ 
tion.  In  the  chapters  which  follow,  we  present  more  complex  algorithms  which  are  based  on 
the  simple  notion  of  a  parallel  traversal  of  a  graph. 
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Chapter  5 


Two  Algorithms  for  Undirected  Graphs 


5.1. Introduction 

In  this  chapter,  we  present  echo  algorithms  for  two  well-known  problems  on  undirected 
connected  graphs.  The  first  is  to  identify  the  biconnected  components  of  a  graph,  and  the 
second  is  to  find  the  minimum  spanning  tree  of  a  graph.  The  biconnected  components  of  a 
directed  graph  are  those  sub-graphs  which  share  a  common  cycle.  Within  a  biconnected 
component  there  are  at  least  two  paths  to  every  node,  but  between  bicdnnected  components 
there  is  only  a  single  path.  In  a  computer  network,  it  may  be  important  to  know  the  sub¬ 
graphs  which  depend  on  a  single  data  path  for  connection  to  the  rest  of  the  graph. 

The  minimum  spanning  tree  of  a  graph  in  which  each  edge  has  a  cost  function  is  that 
spanning  sub-tree  of  the  graph  in  which  every  node  is  present  once,  and  the  sum  of  the  costs 
of  the  edges  in  the  spanning  tree  is  the  minimum  over  all  such  trees.  This  corresponds  to  a 
minimum  cost  way  to  broadcast  information  in  a  computer  network,  one  which  also  uses  a 
minimum  number  of  edges,  since  there  are  no  cycles  in  a  spanning  tree. 

5.2.  Biconnected  Components  of  a  Graph 

For  a  connected  undirected  graph  G  its  biconnected  components  [AHO  74]  are  those 
subsets  of  nodes  and  edges  which  share  a  common  cycle.  There  may  be  edges  in  G  which 
belong  to  no  cycles,  and  these  are  considered  to  be  a  biconnected  component  of  2  members. 
All  sink  nodes,  therefore,  are  in  biconnected  components  of  2  members. 

Equivalently,  biconnected  components  contain  no  internal  articulation  points,  where  “a 
vertex  a  is  said  to  be  an  articulation  point  of  G  if  there  exist  vertices  v  and  w  such  that  v.w 
and  a  are  distinct,  and  every  path  between  v  and  w  contains  the  vertex  a ”  [AHO  74],  How¬ 
ever,  the  biconnected  components  themselves  must  be  connected  through  articulation  points 
of  the  graph  G ,  else  they  would  form  a  single  biconnected  component.  Thus,  although  a 
biconnected  component  does  not  contain  any  internal  articulation  points,  one  or  more  of  its 
nodes  may  be  articulation  points  of  the  whole  graph  G.  Figure  5.1  shows  a  C-graph  and  its 
biconnected  components. 

The  algorithm  which  we  use  is  single-source,  and  is  based  on  echos  describing  their  sub¬ 
graphs  as  they  move  up  the  EG  tree.  These  descriptions  are  simplified  by  replacing  bicon¬ 
nected  components  by  equivalent  single  nodes,  as  they  are  detected  at  articulation  points  of 
the  graph  G.  The  algorithm  finishes  when  the  initiator  node  receives  all  its  echos,  and  does  a 
final  echo-merge.  At  termination,  each  articulation  point  of  the  graph  G  knows  the  bicon¬ 
nected  components  of  which  it  is  a  member.  If  necessary,  the  initiator  node  can  gather  this 
information  by  sending  appropriate  requests  which  trigger  responses  only  at  articulation 
nodes. 

5.2.1.  Termination  and  Intermediate  Sets 

This  method  depends  on  the  echos  returning  two  things:  firstly,  the  condition  of  termina¬ 
tion  of  its  explorer,  whether  at  a  sink  or  at  a  visited  node.  The  second  is  that  an  echo  will 
carry  path  information  to  a  node  concerning  its  history.  An  intermediate  node  v  receiving  an 
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The  biconnected  components  are  {a,b,c},  {c,d},  and  {d,e,f} 
The  articulation  points  are  c  and  d. 


Figure  SI.  A  C-graph  and  its  biconnected  components 
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echo  from  w  will,  if  w  is  a  sink,  identify  itself  as  an  articulation  point  for  the  sink,  and  |v,w} 
as  a  biconnected  component. 

If  a  node  v  receives  one  or  more  echos  with  termination  status  visit,  then  it  looks  for  a 
cycle  among  its  echos.  All  nodes  which  are  involved  in  a  cycle,  among  all  the  nodes  from  its 
echos,  belong  to  the  same  biconnected  component. 

We  can  define  this  more  precisely.  An  echo  will  consist  of  basic  identity,  status ,  and  two 
sets,  INT,  and  TER.  The  set  TER  contains  the  terminal  nodes  of  paths  for  the  sub-tree 
which  the  echo  describes.  The  set  INT  contains  the  internal  nodes  in  paths  of  the  sub-tree 
described  by  a  particular  echo.  These  will  be  referred  to  collectively  as  the  pair  of  the  echo, 
(INT, TER). 

Proposition  5.1.  If  an  echo  has  a  status  of  visit,  then  there  is  another  echo  that  also  has  a 
status  of  visit,  which  shares  a  common  ancestor  node  in  the  execution  graph.  Moreover,  at 
that  ancestor  node,  there  will  be  two  echos  whose  termination  nodes  are  in  the  set  of  inter¬ 
nal  nodes  of  the  other. 

Argument:  We  know  for  a  C-graph,  by  Property  4.2.4,  that  explorers  stop  in  pairs  on  an 
edge  if  the  termination  condition  is  visit.  By  definition,  all  explorers  for  a  single-source  algo¬ 
rithm  originate  from  the  initiator  node,  and  trivially,  they  have  a  common  ancestor. 

If  two  explorers  stop  on  the  same  edge  mutually,  then  if  the  edge  is  (v.w),  and  the  ex¬ 
plorer  stopping  at  w  took  some  path  p  from  the  initiator,  then  the  explorer  stopping  at  v 
took  some  path  q.  The  path  p  has  as  its  last  two  nodes  vw,  and  q  has  as  its  last  two  nodes 
wv.  Thus,  the  termination  of  path  p  is  an  internal  node  of  path  q  and  vice  versa. 

Two  explorers  stopping  mutually  must  have  taken  different  paths.  We  know  they  must 
have  at  least  one  common  ancestor,  the  initiator  node.  However,  if  they  have  more  than  one 
common  node  on  their  paths,  there  must  be  a  last  such  common  node  before  their  paths 
diverge.  This  last  node  will  be  the  root  of  a  sub-tree  in  EG,  and  the  echos  which  it  receives 
will  describe  the  divergent  sub-paths  and  their  terminations.  □ 

Now  we  can  describe  how  echos  are  merged  at  a  node,  which  we  will  call  Echo-merge 

Rules: 

1.  An  initial  echo  whose  origin  is  a  sink  node  w  is  created  with  INT  =  {$}  and  TER  = 
{h>},  and  status  sink.  Then  if  v  receives  an  echo  from  w  with  status  sink,  jv.w)  is  a 
biconnected  component,  and  v  markes  the  edge  (v.w)  inactive. 

2.  Now  consider  the  set  of  echos  with  status  visit  which  come  to  v.  Each  echo  carries  the 
pair  (INT.TER).  We  successively  perform  set  intersection  of  each  TER  with  INT 
from  the  other  echos.  If  all  these  intersections  are  NULL,  then  we  produce  a  new  pair 
(INT.TER)  from  the  union  of  all  the  INTs  and  the  union  of  all  TERs.  A  new  echo  is 
then  formed,  with  the  new  pair  and  a  status  of  visit,  and  sent  along  the  first  edge  of  v. 
It  represents  all  intermediate  and  terminal  nodes  seen  in  the  sub-tree  of  v,  and  further¬ 
more,  describes  the  property  that  no  cycles  were  found,  to  date. 

3.  If  any  TER  IT  INT  ^  <t>,  then  form  a  new  pair  (INT.TER)  from  the  two  echos  by 
doing  unions  on  TER  and  INT.  Replace  the  two  pairs  by  this  new  pair,  and  continue 
(3)  until  the  condition  in  (2)  is  found:  no  cycles  exist  among  the  different  pairs  of 
(INT.TER).  Each  union  is  called  cycle  detection.  Rule  (3)  is  called  maximal  cycle 
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detection. 

4.  Now  take  each  such  pair  ( I  NT, TER )  from  (3),  and  if  any  element  of  TER  is  a 
member  of  its  corresponding  INT  set,  remove  it  from  TER.  At  the  end,  if  TER  is 
empty,  then  we  have  found  a  biconnected  component,  whose  members  are  |v|  U  I.\T. 
We  now  mark  inactive  all  edges  whose  echos  have  gone  to  make  up  this  pair,  save  |v) 
U  INT  as  a  biconnected  component  at  v,  and  remove  this  pair  from  the  pairs  of 
(INT. TER)  being  considered.  The  process  of  eliminating  members  of  TER  is  called 
bicon  identification. 

5.  After  (4)  has  been  applied  to  all  pairs,  either  there  are  no  pairs  left,  in  which  case  v  is 
considered  a  sink,  and  an  echo  with  status  sink  and  INT  of  )</>(  and  TER  of  |v)  is  sent 
along  the  first  edge  of  v.  If  there  are  pairs  left,  then  a  new  ecno  is  formed,  as 
described  in  (2).  The  combined  process  of  maximal  cycle  detection,  and  then  bicon 
identification,  is  called  bicon  composition. 

Proposition  5.2.  A  biconnected  component  of  more  than  two  nodes  is  found  at  a  node  v  if 
and  only  if  at  least  two  echos  return  whose  bicon  composition  yields  the  pair  (INT. TER) 
with  TER  being  empty,  called  the  zero  pair. 

Argument:  Biconnected  components  of  two  nodes  are  found  if  one  of  the  nodes  is  a  sink. 
This  only  requires  one  echo  to  return  to  v  from  sink  w.  Furthermore,  if  w  is  a  sink,  then  by 
Rule  1,  v  will  identify  it  as  such. 

Now  consider  a  biconnected  component  M  of  at  least  three  nodes,  one  of  which  is  v, 
with  v  being  an  articulation  point  of  G.  Then  all  explorers  going  into  M  have  come  through 
v,  and  v  is  the  common  ancestor  of  all  the  echos  in  M.  Whatever  paths  are  taken  from  v 
into  M ,  they  must  end  in  nodes  which  are  in  M.  Therefore  in  the  echos  which  come  back  to 
v,  the  set  of  terminal  nodes  must  all  find  matching  internal  nodes. 

Consider  a  node  at  which  bicon  composition  yields  a  zero  pair.  We  show  that  the  nodes 
in  INT  must  be  a  biconnected  component.  Maximal  cycle  detection  assures  us  that  terminal 
nodes  of  the  paths  taken  from  v  end  in  intermediate  nodes  of  other  paths,  and  therefore  that 
these  nodes  are  linked.  Furthermore,  there  is  no  path  which  terminates  at  a  node  outside  of 
the  nodes  in  INT. 

The  only  way  in  which  the  nodes  and  edges  of  INT  do  not  form  a  biconnected  com¬ 
ponent  is  if  there  is  some  path  from  one  of  the  nodes  in  INT  which  leads  to  a  node  outside 
of  INT,  thus  making  INT  part  of  a  larger  biconnected  component.  This  cannot  happen 
without  this  edge  terminating  at  some  node  x  outside  of  INT.  If  so,  then  bicon  identification 
would  have  produced  TER  containing  .r,  not  empty.  Thus,  there  is  no  such  edge  leading  out 
of  the  nodes  of  INT. 

Note  that  for  v  to  be  the  articulation  point  of  a  biconnected  component  in  the  graph  G, 
there  must  be  at  least  two  echos  returning  to  v.  Otherwise,  it  would  not  be  the  root  of  diver¬ 
gent  paths  leading  into  the  biconnected  component,  and  some  other  node,  either  further  up 

or  down  the  tree,  would  identify  the  biconnected  component.  □ 

% 

5.2.2.  Algorithm  5.0.  Biconnected  Component  Detection 

The  algorithm  for  finding  the  biconnected  components  of  a  C-graph  has  been  largely 
described  above  by  the  rules  for  bicon  composition.  Given  an  initiator  node  5,  let  it  execute 
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the  forward  phase  of  a  pure  traversal,  by  sending  explorers  in  parallel  on  the  non -first  edges 
of  nodes  as  they  are  first  visited. 

1.  Let  the  initiator  send  explorers  in  parallel  on  its  edges.  The  first  explorer  at  each  node 
causes  more  explorers  to  be  sent  in  parallel  from  that  node.  An  explorer  coming  to  a 
visited  node  v  sends  an  echo  back  with  the  pair  (INT).TER)  of  (|0}),|v}),  and  status  of 
visit.  If  the  node  is  a  sink,  the  same  ( I\T.TER )  pair  will  be  sent,  but  the  status  of  the 
echo  will  be  sink. 

2.  If  a  node  v  receives  an  echo  with  status  sink  having  (INT.TER)  of  ({<£}, jwj),  then  {v.w} 
is  a  biconnected  component,  and  the  edge  on  which  the  echo  arrived  is  marked  inac¬ 
tive.  Now  try  (3). 

3.  If  all  the  echos  for  a  node  v  have  arrived,  and  all  edges  have  been  marked  inactive , 
then  create  a  new  echo  with  status  sink,  and  ( INT.TER )  of  (|0|,{v}),  and  send  the  echo 
on  the  first  edge  of  the  node. 

4.  Otherwise  do  bicon  composition  according  to  Echo-merge  Rules  2  to  4,  identify  all 
biconnected  components,  mark  appropriate  edges  as  inactive.  If  all  edges  are  now 
inactive,  then  do  (3),  else  send  the  echo  built  by  Echo-merge  Rule  2,  along  the  first 
edge  of  the  node. 

5.  If  the  node  in  (3)  and  (4)  is  the  initiator,  then  the  algorithm  is  done.  A  check  for 
correctness  here  is  that  all  edges  should  be  inactive,  and  there  should  be  no  terminal 
nodes  still  unaccounted. 

5.2 3.  Behaviour 

We  must  now  consider  the  behaviour  of  the  algorithm.  Given  that  there  is  a  single  initia¬ 
tor  node,  the  algorithm  is  very  similar  to  the  pure  traversal  algorithm,  in  that  explorers 
make  one  forward  sweep,  and  echos  one  sweep  back  towards  the  initiator.  Thus,  elapsed 
time,  considering  communications  mainly,  is  approximately  2D.  where  D  is  the  diameter  of 

the  graph. 

The  number  of  messages  is.  as  with  a  pure  traversal  algorithm,  between  2e  and  4e.  The 
major  component  of  effort  in  this  algorithm  comes  at  each  echo-merge,  and  in  the  amount 
of  storage  required  at  each  node. 

Each  echo  carries  the  sets  INT  and  TER.  At  the  worst,  the  set  INT  can  contain  n 
nodes,  and  the  same  can  be  said  of  the  set  TER.  The  fact  that  at  any  time,  the  sum  of  the 
nodes  held  in  TER  and  INT  can  be  no  more  than  2 n  helps  in  considering  overall  work,  but 
not  in  the  amount  of  space  that  must  be  allocated  for  worst-case  consideration.  If  each  set 
of  INT.TER  must  contain  n  nodes,  then  n  log  n  bits  are  needed  for  each  set.  For  a  node  to 
hold  n  echos,  then,  at  least  2  n 2  log  a  bits  are  needed  for  all  echos  to  describe  the  pairs 
(INT.TER).  Of  course,  if  we  knew  the  value  of  n  a  priori,  then  a  bit  map  would  suffice  for 
storing  the  sets,  which  brings  the  requirement  down  to  n  2. 

Now  consider  the  number  of  operations  at  each  bicon  composition.  Maximal  cycle 
detection  involves  the  intersection  of  each  TER  against  a  set  INT  from  a  different  echo. 
There  can  be  at  most  n  echos  at  each  node.  If  each  TER  is  intersected  against  n  —  1  INT 
sets,  then  there  are  n~  intersections.  At  best,  each  set  operation  can  be  considered  as  one 
operation.  Thus,  there  are  at  least  n~  operations  at  each  node,  for  cycle  detection. 
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Bicon  detection,  given  the  set  of  disjoint  pairs  found  by  maximal  cycle  detection,  consists 
of  removing  elements  of  TER  which  are  in  its  corresponding  I NT  set.  There  are  at  most 
n /2  new  pairs,  and  each  such  removal  can  be  done  in  at  most  n  operations.  The  operations 
at  each  node  are  dominated  by  maximal  cycle  detection,  which  is  basically  n  2. 

These  bounds  by  no  means  reflect  an  average  situation.  For  one  thing,  if  there  are  only 
sinks,  then  maximal  cycle  detection  is  never  needed.  In  the  case  of  a  fully  connected  graph, 
each  path  is,  on  the  average,  two  long.  The  expected  number  of  set  operations  at  each  non- 
initiator  node,  which  receives  n  —  2  echos,  is  just  one,  since  each  echo  has  an  INT  of  (0),  and 
hence  does  not  have  to  participate  in  any  set  intersection.  At  the  initiator  node,  n  —  1  echos 
arrive,  but  each  intersection  of  an  INT  with  a  TER  from  a  different  echo  will  find  a  match¬ 
ing  element.  Thus,  only  n  —2  set  operations  are  needed.  In  comparison,  if  all  INT  and 
TER  are  disjoint,  then  (n  —  l)2  intersections  would  have  been  needed.  We  leave  the  study  of 
this  aspect  to  future  research. 

5.3.  Minimum  Spanning  Tree 

To  our  distributed  system  as  described  in  Chapter  1,  we  add  the  attribute  that  each  edge 
has  a  positive  weight  associated  with  it.  Let  the  cost  of  going  from  v  to  vv  be  the  sum  of  the 
edges  making  up  the  path  from  v  to  w.  The  minimum  spanning  tree  is  a  tree  containing  all 
the  nodes  of  a  graph  G  exactly  once,  in  which  the  sum  of  the  edges  is  the  minimum  over  all 
such  spanning  trees. 

In  Chapter  2,  we  have  described  Spira’s  decentralized  method  for  finding  a  minimum 
spanning  tree  [SPIR  77].  Essentially,  this  is  done  by  a  recursive  technique.  A  Level  1  frag¬ 
ment  consists  of  a  core  pair  of  nodes  which  are  each  other’s  nearest  neighbours,  plus  all  the 
nodes  which  are  the  nearest  neighbours  of  nodes  in  the  fragment.  Then  successively,  frag¬ 
ments  are  combined  to  form  higher  level  fragments.  This  is  done  by  each  fragment  finding 
its  cheapest  out-going  edge  to  connect  to  another  fragment,  making  the  connection,  elim¬ 
inating  all  internal  edges,  and  then  finding  the  cheapest  out-going  edge  of  the  larger  frag¬ 
ment.  This  is  a  technique  which  was  also  proposed  by  Rosenstiehl  [ROSE  72].  This  method 
requires  seven  different  message  types,  and  is  very  complex. 

We  present  a  simpler  version,  which  is  based  on  Kruskal’s  [KRUS  56]  sequential  algo¬ 
rithm.  The  algorithm  is  based  on  having  a  sorted  list  of  all  edges,  with  their  associated  costs, 
and  a  set  VS  containing  initially  subsets  each  holding  one  of  the  nodes  of  G.  Starting  from 
the  cheapest  edge,  each  edge  is  considered  in  order,  and  any  which  would  connect  two  sub¬ 
sets  of  KS  is  accepted,  and  placed  into  T ,  the  set  of  accepted  edges.  The  algorithm  finishes 
when  KS  has  been  reduced  to  a  single  connected  set. 

53.1.  Decentralized  Kruskal’s  Algorithm 

This  is  similiar  in  concept  to  Kruskal’s  algorithm.  However,  the  algorithm  proceeds 
locally  at  each  node.  Instead  of  accepting  edges  into  a  final  set  T,  each  node  only  finds  those 
edges  which  would  be  candidates  at  the  next  level.  An  echo  simply  carries  a  set  of  edges 
with  associated  weights,  and  echo-merge  produces  a  new  set  of  such  edges.  The  initiator 
node  will  produce  the  final  set  of  edges  which  will  be  the  minimum  spanning  tree. 
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1.  Let  the  initiator  node  execute  a  traversal  algorithm.  However,  let  each  echo  carry 
some  information. 

2.  Let  the  explorer  stopping  at  node  v  from  node  w  create  an  echo  with  set  E  containing 
(v,w)  and  its  cost. 

3.  Consider  any  node  a.  From  its  echos,  it  gets  the  sets  Eh  Its  first  task  is  to  find  the 

discrete  nodes  represented  by  all  these  sets,  as  the  local  algorithm  stops  when  all  nodes 
have  been  connected.  Thus,  it  creates  a  set  KS",  containing  discrete  subsets  each  of  a 
single  node. 

4.  Next  it  performs  Kruskal’s  algorithm  as  described  above,  placing  the  accepted  edges 
into  a  temporary  set  T  for  the  node. 

5.  If  this  node  is  not  the  initiator  node,  then  it  forms  its  echo.  If  the  echo  is  to  be  sent  to 

node  b ,  it  creates  the  set  E  from  the  union  of  T  and  {(a,b)).  This  echo  is  then  sent  to 

b. 

6.  If  this  is  the  initiator  node,  then  the  set  T  contains  the  edges  in  the  minimum  spanning 
tree  of  G. 

At  no  point  is  any  edge  permanently  accepted.  However,  each  intermediate  node  elim¬ 
inates  those  edges  which  would  not  have  been  candidates  at  a  higher  level  anyway.  This  is  so 
because  within  this  group  of  nodes,  a  minimum  spanning  tree  has  been  found,  so  the  only 
edges  eliminated  are  those  which  interconnect  this  group  of  nodes.  Edges  connecting 
different  groups  are  considered  at  higher  levels  of  the  execution  tree. 

The  fact  that  explorers  can  travel  in  two  directions  on  some  edges  means  that  the  same 
edge  can  appear  in  different  sub-trees.  However,  the  total  number  of  edges  carried  by  any 
one  message  can  only  be  e,  the  total  number  of  edges  in  the  graph.  In  fact,  however,  the 
sum  of  the  edges  carried  by  echos  in  any  one  level  is  no  more  than  2e,  for  at  the  highest 
level,  at  worst  no  edges  have  been  eliminated,  and  all  edges  are  represented  by  echos  going 
to  the  initiator.  Thus,  the  total  number  of  edges  carried  by  all  echos  in  an  execution  of  the 
algorithm  is  bounded  by  2 de  where  e  is  defined  above,  and  d  is  the  maximum  depth  of  the 
tree.  Finally,  the  elapsed  communication  time  for  the  algorithm,  not  including  processing 
cost,  is  simply  the  time  required  to  do  a  traversal  algorithm,  which  is  2D ,  where  D  is  the 
diameter  of  the  graph. 

In  terms  of  the  behaviour  of  the  algorithms  in  this  chapter  in  the  face  of  failures  of 
nodes  or  links,  the  comments  on  reliability  in  Chapter  4  are  entirely  applicable,  since  these 
are  single-source  echo  algorithms. 
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Chapter  6 


Directed  Graphs  and  Echo  Algorithms 


6.1.  Introduction 

This  chapter  is  a  consideration  of  echo  algorithms  in  the  environment  of  directed  graphs. 
Let  a  directed  edge  represent  a  particular  semantic  relationship  between  two  nodes,  which 
we  do  not  need  to  specify  to  appreciate  the  nature  of  echo  algorithms  for  such  graphs.  We 
maintain,  however,  the  premise  that  each  edge  is  bidirectional  in  terms  of  supporting  com¬ 
munications  between  nodes.  We  further  assume  that  each  node  stores  all  its  edge  relation¬ 
ships,  both  in-edges  and  out-edges.  This  is  not  unrealistic  in  a  distributed  environment  in 
which  an  undirected  graph  would  be  implemented  by  every  node  maintaining  all  its  adjacen¬ 
cies,  and  a  directed  edge  needing  only  one  more  bit. 

In  this  chapter,  we  will  present  an  echo  algorithm  for  detecting  the  strongly  connected 
components  of  a  directed  graph,  and  a  special  case  of  this  algorithm  which  detects  whether 
or  not  a  directed  graph  has  the  particular  configuration  called  a  knot.  We  then  show  how 
the  knot  detection  algorithm  can  be  applied  to  the  problem  of  detecting  deadlock  in  a  distri¬ 
buted  system  of  processes  and  resources  whose  interactions  can  be  modelled  by  a  directed 
graph. 

6.2.  Strongly  Connected  Components 

First  we  consider  the  algorithm  for  finding  the  strongly  connected  components  of  a 
directed  graph.  Given  a  directed  graph  G={V,E\  the  strongly  connected  components 
Gj  =  (Vi.Ej)  are  those  vertices  and  edges  which  satisfy  the  relation,  for  each  such  com¬ 
ponent,  that  if  v.w  are  two  nodes  in  the  component,  then  there  are  directed  paths  both  from 
v  to  w  and  from  w  to  v.  The  edges  connecting  all  such  nodes  for  a  particular  Vj  are 
members  of  £,.  Tarjan  [TARJ  72]  has  presented  a  sequential  method  for  finding  all  strongly 
connected  components  which  depend  on  a  depth-first  search  of  the  graph.  Since  the  tech¬ 
nique  of  echo  algorithms  is  a  depth-parallel  traversal,  we  should  be  able  to  find  an  efficient 
echo  algorithm  to  accomplish  the  same  purpose. 

Our  method  consists  of  first  tagging  all  nodes  which  can  reach  a  particular  initiator  node 
via  directed  paths,  then  finding  those  nodes  which  are  reachable  from  that  initiator  node.  All 
such  reachable  nodes  which  have  been  tagged  are  members  of  the  strong  component  which 
includes  the  initiator,  and  all  edges  which  reach  tagged  nodes  are  also  in  the  strong  com¬ 
ponent.  We  need,  in  addition,  several  other  mechanisms  in  order  to  execute  this  algorithm 
for  the  entire  graph  G. 

First  consider  any  initiator  node  S.  The  strong  component  algorithm,  which  we  call  Al¬ 
gorithm  SCA,  consists  of  two  echo  algorithms.  The  first,  which  we  call  Algorithm  SCA-B,  is 
a  backwards  algorithm,  and  tags  all  nodes  which  can  reach  S ,  unless  they  are  already 
members  of  a  strong  component,  with  the  name  S.  The  second  echo  algorithm  is  called  Al¬ 
gorithm  SCA-F,  and  is  a  forward  algorithm,  which  finds  the  nodes  reachable  from  S.  All 
such  nodes  which  have  been  tagged  with  S,  and  the  edges  on  which  they  were  reached  from 
S,  are  included  in  the  strong  component  of  5,  which  we  call  SC-S.  These  nodes  are  marked 
with  the  name  5,  to  show  that  they  are  members  of  SC-S.  An  explorer  which  reaches  a 
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marked  node  terminates  and  becomes  an  echo. 

The  other  mechanism  we  need  is  a  sequential  method  of  ensuring  that  every  node  in  turn 
executes  Algorithm  SCA.  This  algorithm,  which  we  call  SEQUENCE,  lets  each  node  which 
has  not  been  marked  as  a  member  of  a  strong  component,  execute  Algorithm  SCA.  Nodes 
which  have  been  marked  are  bypassed.  We  have  already  presented  an  algorithm  which  re¬ 
turns  a  sorted  list  of  nodes  to  an  initiator  node.  By  either  using  a  spanning  tree,  or  a  pure 
traversal  algorithm,  the  initiator  can  sequence  through  each  node  in  turn. 

The  strong  components  algorithm  thus  consists  of  Algorithm  SEQUENCE,  which  exe¬ 
cutes  Algorithm  SCA  for  nodes  not  in  a  strong  component,  and  Algorithm  SCA  consists  of 
the  two  echo  algorithms,  one  for  tagging  reachability,  and  the  other  for  marking  members  of 
the  strong  components. 

Algorithm  SEQUENCE 

1.  For  each  candidate  initiator  node  S,  if  it  has  not  already  been  marked  as  a  member  of 
a  strong  component,  execute  Algorithm  SCA. 

2.  If  already  marked ,  then  consider  the  next  candidate.  If  no  further  candidates,  then  the 
algorithm  is  done. 

Algorithm  SCA 

1.  Execute,  for  initiator  S,  Algorithm  SCA-B. 

2.  After  S  becomes  aware  of  the  termination  of  Algorithm  SCA-B,  execute  Algorithm 
SCA-F. 

Algorithm  SCA-B  Tagging  Algorithm  for  S 

1.  From  5,  send  explorers  on  in-edges  of  S.  If  there  are  no  in-edges,  then  we  are  done. 

2.  Consider  a  node  i  on  which  an  explorer  from  node  j  arrives.  If  i  has  been  marked  as  a 
member  of  a  strong  component,  then  send  an  echo  back  to  j. 

3.  If  i  has  not  been  marked ,  and  has  not  been  visited  by  an  explorer  from  S,  send  explor¬ 
ers  in  parallel  on  the  in-edges  of  /  and  tag  node  i  with  S.  If  i  is  a  sink  node  (no  in¬ 
edges),  then  send  an  echo  to  node  j.  If  /  has  been  visited  by  an  explorer  from  5.  then 
send  an  echo  back  to  node  j. 

4.  Echos  are  treated  as  in  any  echo  algorithm.  A  node  waits  for  all  echos  to  return  from 
edges  on  which  explorers  were  issued,  and  then  itself  echos  along  its  first  edge. 

5.  When  the  initiator  node  gets  echos  back  along  all  its  in-edges,  then  Algorithm  SCA-B 
is  done. 

Clearly,  Algorithm  SCA-B  is  a  simple  traversal  echo  algorithm,  run  on  the  image  of  the 
graph,  treating  in-edges  as  out-edges  and  vice  versa.  Each  node  reached  by  Algorithm 
SCA-B  is  a  node  with  a  directed  path  to  5  in  the  graph  G. 
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Algorithm  SCA-F  Forward  Marking  Algorithm 

1.  Run  a  normal  pure  traversal  algorithm  from  S,  in  which  explorers  follow  only  out- 
edges. 

2.  If  an  explorer  comes  to  a  node  which  is  not  tagged  S,  then  send  an  echo  back  on  the 
edge  of  arrival  of  the  explorer. 

3.  An  explorer  which  comes  to  a  node  which  has  already  been  marked  as  a  member  of  a 
strong  component  sends  an  echo  back. 

4.  An  explorer  which  comes  to  a  node  which  has  not  been  marked  at  all,  but  has  been 
tagged  S ,  then  marks  it  with  the  name  S,  as  a  member  of  the  strong  component  of  5, 
and  also  marks  the  edge  on  which  it  arrived,  as  being  a  member  of  the  strong  com¬ 
ponent  S. 

5.  Each  echo  carries  a  pair  ( V.E ).  A  node  waits  for  all  its  echos  to  arrive,  and  then  forms 
the  new  pair  (V.E)  consisting  of  the  union  of  all  the  K(,£(.  If  the  node  is  marked  S ,  it 
includes  itself,  and  its  first  edge,  in  the  new  pair.  It  then  echos  to  its  predecessor.  An 
echo  created  by  the  termination  of  an  explorer  will  have  the  empty  pair. 

6.  If  a  node  has  received  all  its  echos,  created  its  new  pair,  and  finds  it  is  the  initiator 
node  S,  then  it  has  found  all  the  members  of  the  strong  component  SC-S. 

Several  comments  are  in  order  at  this  point.  If  an  explorer  in  Algorithm  SCA-F  reaches 
a  sink  node,  it  could  not  have  been  tagged ,  by  definition.  Also,  a  node  which  receives  non¬ 
null  pairs  (V.E)  in  one  or  more  of  its  echos  must  itself  be  tagged  S.  For  if  it  was  not,  then 
it  could  not  reach  S\  but  then  neither  could  any  of  its  descendents  in  G.  Therefore  they 
could  not  possibly  be  members  of  the  strong  component  SC-S.  Figure  6.1  is  an  example 
illustrating  the  detection  of  the  strong  components  of  the  graph  G  using  this  method. 

6.2.1  Behaviour 

Clearly,  Algorithm  SCA-B  and  SCA-F  are  the  operational  parts  of  the  strongly  con¬ 
nected  algorithm.  Consider  Algorithm  SCA-F  first.  It  is  a  forward  traversal  of  the  graph 
reachable  from  an  initiator  node  5",  but  is  limited  by  nodes  which  are  either  not  tagged  S,  or 
by  nodes  which  have  been  marked  as  members  of  other  strong  components.  Thus,  no  strong 
component,  once  identified,  is  traversed  again.  Algorithm  SCA-F  can  be  thought  of  as  the 
cumulative  traversal  of  sub-graphs  which  have  no  overlap,  though  they  may  share  borders. 
The  number  of  edges  in  all  executions  of  SCA-F  for  one  execution  of  Algorithm 
SEQUENCE  is  simply  the  number  of  edges  in  the  graph,  traversed  once  by  explorers  and 
once  by  echos.  This  is  2e. 

The  number  of  messages  for  Algorithm  SCA-B  is  more  difficult.  Consider  a  sub-graph 
which  has  a  directed  path  to  S  but  is  not  reachable  from  5.  Then  Algorithm  SCA-B  will 
send  explorers  into  this  sub-graph.  Furthermore,  there  may  be  a  sequence  of  such  nodes 
which  are  reachable  from  the  sub-graph,  but  cannot  reach  it.  Figure  6.2  is  an  illustration  of 
this  situation.  Then  each  invocation  of  Algorithm  SCA-B  may  send  explorers  into  this  sub¬ 
graph.  By  modifying  Algorithm  SCA-B  we  dm  avoid  this  redundancy. 

6.2.2.  Optimized  Version  of  Algorithm  SCA 

Let  the  echos  in  Algorithm  SCA-B  bring  back  the  nodes  visited  by  explorers,  so  that  the 
initiator,  at  the  end  of  Algorithm  SCA-B,  has  a  set  of  nodes,  those  which  can  reach  it  from 
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Consider  Algorithm  SCA-B  run  successively  from 
nodes  a,b,c,d  in  that  order.  Each  of  them  will 
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Figure  6. 2 An  illustration  of  repeated  traversals  of  a  sub-graph  in  Algorithm  SCA-B 


directed  paths  (and  not  already  marked  in  a  strong  component).  Call  this  the  R-from  set  of 
S,  denoted  {/?).  Every  node  in  the  sub-tree  of  the  initiator  will  also  compute  its  own  R-from 
set.  For  every  new  invocation  of  Algorithm  SCA-B,  then,  only  those  nodes  which  have  not 
yet  found  what  nodes  reach  them  need  send  explorers  on  their  in-edges;  An  explorer  coming 
to  a  node  which  knows  what  its  R-from  set  is  can  echo  immediately  with  its  R-from  set  U 
the  node  name.  Obviously,  if  node  v  can  be  reached  by  {/?},  and  w  can  be  reached  from  v, 
then  w  can  be  reached  from  {/?}  U  [vj. 

Now  Algorithm  SCA-F  must  be  changed.  A  node  which  is  reachable  from  S,  and  which 
can  reach  S,  is  in  the  strong  component  of  S.  Therefore  explorers  from  S ,  in  Algorithm 
SCA-F,  must  carry  the  R-from  set  of  S.  If  an  explorer  comes  to  a  node  which  is  in  the  R- 
from  set  of  S,  then  this  node  is  a  member  of  the  strong  component  SC-S. 

Let  us  call  this  the  optimized  version  of  Algorithm  SCA,  and  the  original  the  tagging 
version.  Given  these  modifications,  the  number  of  message  passes  for  Algorithm  SCA-F  has 
not  changed.  It  is  still,  for  all  executions  of  SCA-F  in  one  execution  of  SEQUENCE,  2e, 
where  e  is  the  number  of  edges  in  the  graph.  However,  the  size  of  the  explorer  message  is 
now  at  most  n  log/i,  since  each  node  description  takes  log/i  bits,  and  there  are  n  nodes.  For 
the  SCA-B  algorithm,  each  edge  of  the  graph  need  only  be  traversed  once  in  each  direction, 
since  once  the  R-from  set  of  a  particular  node  is  found,  it  does  not  change.  At  each  node, 
however,  at  most  n  node  names  must  be  stored,  which  takes  n  \ogn  bits.  Thus,  we  have  a 
trade-off  of  storage  and  message  size,  to  avoid  potentially  retracing  sub-graphs  in  Algorithm 
SCAB. 

The  time  taken  is,  using  the  modifications  above,  essentially  a  sequence  of  parallel 
traversal  of  pieces  of  the  graph.  There  is  no  overlap  of  pieces,  however,  so  that  the  time  is 
bounded  by  0(n),  where  n  is  the  number  of  nodes  in  the  graph.  In  other  words,  it  is  no 
worse  than  if  all  the  nodes  are  visited  in  sequence,  which  the  Algorithm  SEQUENCE  must 
do  anyway. 

During  an  execution  of  Algorithm  SEQUENCE,  the  graph  may  change  if  a  node  fails. 
Then  any  prior  strong  component  information  may  be  in  error,  and  in  addition,  parts  of  the 
graph  which  have  been  tagged  may  become  unreachable.  Thus,  failure  of  a  node  must,  when 
detected  during  execution,  lead  to  ABORT,  a  re-organization  of  adjacency  data,  and  then  a 
restart  of  the  algorithm.  Since  we  assume  bidirectional  communications  on  all  links,  we  can 
use  the  ABORT  technique  of  Chapter  4. 

63.  Detection  of  a  Knot 

Holt  [HOLT  72]  described  a  knot  in  a  directed  graph  G  as  a  set  of  nodes  in  which  every 
node  is  reachable  from  every  other  node.  More  precisely,  if  there  is  a  directed  path  from 
node  v  to  w,  then  w  is  a  descendent  of  v.  A  knot  is  a  set  of  nodes  in  which  the  descendents 
of  every  node  is  the  set  of  nodes  itself.  Thus,  we  first  want  to  know  what  nodes  reach  an  ini¬ 
tiator  node  S.  Secondly,  we  want  to  know  what  nodes  S'  can  reach.  For  a  strong  component, 
all  nodes  S  can  reach  which  also  reach  S  are  members  of  the  strong  component  of  S.  How¬ 
ever,  in  a  knot,  all  of  the  nodes  which  S  can  reach  must  also  reach  S,  or  else  the  initiator  is 
not  a  member  of  a  knot. 

The  algorithm  for  detecting  a  knot  is  a  simple  variant  of  the  algorithm  for  detecting 
strong  components.  First  of  all,  we  do  not  need  the  Algorithm  SEQUENCE.  We  are 
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interested  in  finding  out  whether  a  particular  initiator  node  is  a  member  of  a  knot.  Secondly, 
the  modification  is  made  to  Algorithm  SCA-F  so  that  if  any  node  is  reached  which  does  not 
reach  S,  then  we  know  we  have  violated  a  necessary  condition  for  a  knot.  Hence,  an  echo 
bearing  a  negative  signal  can  be  generated,  which  will  cause  further  negative  echos  to  be  sent 
by  each  intermediate  node.  Finally,  if  all  explorers  in  Algorithm  SCA-F  reach  nodes  which 
can  reach  S,  all  echos  generated  will  be  normal ,  and  finally  S  will  know  that  it  is  a  member 
of  a  knot. 

It  does  not  matter  which  version  of  Algorithm  SCA  we  use.  The  original  tagging  version 
clearly  will  identify  whether  or  not  the  condition  for  a  knot  is  present.  The  optimized  algo¬ 
rithm  which  depends  on  finding  R-from  sets  of  nodes  will  also  function  equally  well. 

Since  knot  detection  can  be  thought  of  as  a  one-time  event,  the  arguments  made  for 
minimizing  the  number  of  traversals  through  a  sub-graph  are  not  quite  as  relevant.  Thus,  if 
the  tagging  algorithm  is  used,  the  number  of  messages  is  basically  0(e),  where  e  is  the 
number  of  edges  in  the  graph,  and  the  time  is  0(D) ,  where  D  is  the  diameter  of  the  graph. 
Since  there  is  a  backward  and  a  forward  echo  algorithm,  the  time  is  bounded  by  4 D,  and 
the  number  of  messages  by  Ae. 

63.1.  One-pass  Algorithm 

It  is  reasonable  to  ask  whether  all  the  information  required  to  detect  membership  in  a 
knot  cannot  be  obtained  in  a  single  pass  of  a  traversal,  since  the  elapsed  time  would  then 
only  be  2D.  We  have  discovered  such  an  algorithm,  but  it  is  certainly  more  complex  than 
the  rather  elegant  approach  given  above.  It  is  described  in  some  detail  in  Appendix  III,  and 
we  will  only  give  a  few  brief  comments  here. 

A  node  5  is  not  a  member  of  a  knot  if  it  can  reach  a  sink  node  or  a  group  of  nodes 
which  have  no  sinks  but  also  have  no  paths  leading  to  S',  which  we  call  an  extended  sink. 
Figure  6.3  illustrates  this  concept.  If  we  could  identify  an  extended  sink,  we  can  treat  it 
simply  as  one  large  sink  node.  Then  the  presence  of  such  a  node  would  negate  the  member¬ 
ship  of  S  in  a  knot. 

We  want  to  use  an  echo  algorithm,  which  is  based  on  the  technique  of  explorers  turning 
into  echos  at  sinks,  or  visited  nodes.  In  this  method,  we  would  have  echo  status  of  sink,  visit 
or  conn,  which  indicates  that  these  explorers  connected  to  the  initiator  node,  at  the  initia¬ 
tor.  Clearly  S  is  a  member  of  a  knot  if  all  its  echos  return  with  status  conn.  An  echo  of  sink 
would  always  dominate  the  merging  of  echos,  and  give  the  new  echo  a  status  of  sink,  for 
obvious  reasons.  The  problem,  then,  is  to  identify  extended  sinks  using  an  echo  algorithm. 

The  problem  is  made  difficult  by  the  fact  that  the  explorers  in  extended  sinks  must  stop 
at  visited  nodes,  and  therefore  the  node  with  an  edge  leading  into  an  extended  sink  gets  back 
an  echo  with  status  visit.  However,  a  node  which  sends  explorers  to  nodes  already  visited, 
which  lead,  however,  to  S,  also  get  back  a  status  of  visit  from  all  its  explorers.  If  we  retain 
the  technique  of  echoing,  then  to  distinguish  the  two  cases,  we  must  do  it  by  default.  In 
other  words,  we  identify  the  latter  case,  while  trusting  that  extended  sinks  send  no  further 
echos  back.  We  must  require  nodes  to  send  echos  along  all  in-edges,  rather  than  just  the 
first  edge,  and  furthermore,  if  a  new  echo  comes  to  a  node  as  a  result,  it  must  do  further 
echo-merging,  until  the  system  stabilizes. 

This  is  still  not  sufficient,  for  a  node  with  one  edge  to  an  extended  sink  and  another  to  S 
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gets  back  the  echos  visit  and  conn.  Thus,  this  node  must  echo  the  compound  status  vis-con. 
By  a  complex  series  of  case  arguments,  it  is  possible  to  show  that  by  sending  new  echos  on 
non-first  in-edges,  and  by  using  a  complex  echo  merge  scheme,  and  finally  by  counting  on 
extended  sinks  not  to  produce  new  echos,  membership  in  a  knot  can  always  be  identified. 
We  were  able  to  bound  the  number  of  statuses  for  echos  to  four,  and  to  limit  the  number  of 
iterations  of  echo-merging  to  three. 

6.3,2.  Dijkstra’s  Approach 

The  awkwardness  of  the  one-pass  approach  led  Dijkstra  [DIJK  79a]  to  produce  a  solu¬ 
tion  to  the  knot  detection  problem,  which  in  turn  led  us  to  the  present  two-pass  echo  algo¬ 
rithm.  The  approach  which  he  took  is  essentially  in  three  phases.  The  first  establishes  the 
truth  of  the  predicate  Q(l):  node  X  is  reachable  from  node  D,  where  D  is  the  initiator  node. 
The  second  phase  establishes  for  each  node  the  truth  of  the  predicate  Q(2):  Node  D  is  reach¬ 
able  from  node  X.  The  third  would  essentially  establish  that  node  D  is  reachable  from  all 
nodes  which  node  D  reaches.  The  presentation  of  this  solution  provides  an  interesting  exer¬ 
cise  in  the  use  of  guarded  commands  for  this  “diffusion  computation”  [DIJK  78].  However, 
his  solution  in  its  present  form  does  not  answer  no  if  a  knot  is  not  present. 

6.4.  Deadlock 

Now  we  will  look  at  the  application  of  the  knot  detection  algorithm  to  the  decentralized 
detection  of  deadlock  in  distributed  systems.  Following  Holt  [HOLT  72],  process-resource 
interactions  can  be  represented  by  a  bipartite  directed  graph  <P,R,E>  where  P  is  a  set  of 
processes,  R  is  a  set  of  resources  and  E  is  a  set  of  directed  edges  of  the  form  \p,r ]  or  [r.p]. 
Process  p  owning  resource  r  is  represented  by  an  [r.p]  edge,  directed  from  r  to  p,  while  p 
waiting  for  r  is  represented  by  a  directed  [p.r]  edge.  Assume  that  each  resource  is  composed 
of  several  units,  and  that  a  process  may  simultaneoulsy  request  several  resources  in  arbitrary 
units  up  to  the  maximum  available  for  a  resource.  A  granted  request  will  cause  an  [r.p] 
edge  to  be  formed,  while  an  unavailable  resource  will  generate  a  [p.r ]  edge.  Assume  that  the 
formation  of  edges  is  instantaneous. 

A  process  p  which  has  one  or  more  [p.r ]  edges  is  in  the  wait  state,  and  is  said  to  be 
blocked.  It  cannot  make  further  requests,  nor  release  its  resources;  it  can  only  wait  for 
them.  A  node  p  which  only  has  [r.p]  edges  has  all  its  needed  resources  and  is  in  a  running 
state,  during  which  it  may  make  further  requests,  release  one  or  more  resources,  or  release 
all  its  resources  and  terminate.  Nodes  which  have  no  edges  are  inactive,  and  we  consider 
them  not  to  be  in  the  graph.  Let  us  define  a  system  graph  to  be  the  <P,R,E>  graph  of  all 
the  processes  and  resources  of  a  system,  together  with  the  edges  which  represent  their 
interactions.  Clearly  a  system  graph  can  consist  of  several  disjoint  connected  sub-graphs.  We 
define  a  process-resource  graph,  or  a  P-R  graph,  to  be  such  a  connected  sub-graph  of  a  sys¬ 
tem  graph.  Note  that  this  excludes  the  trivial  sub-graphs  consisting  of  single  unconnected 
nodes. 

We  assume  that  running  processes  terminate  after  a  finite  period,  and  that  they  release 
their  resources.  These  resources  are  then  given  to  waiting  processes,  with  the  appropriate 
edge  formations  made  in  the  P-R  graph.  .Thus,  the  model  is  a  general  one  in  which 
resources  are  allocated  whenever  possible,  and  processes  wait  for  unavailable  resources.  In 
such  an  environment,  deadlock,  or  permanent  blocking,  as  Holt  described  [HOLT  72],  will 
occur  among  the  nodes  of  a  knot.  A  process  is  thus  permanently  blocked,  or  in  a  deadlock, 
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if  it  is  a  member  of  a  knot.  Conversely,  a  process  node  is  not  deadlocked  if  it  is  connected 
to  a  sink  node,  a  node  which  has  no  out-edges.  It  is  easy  to  see  that  this  must  be  a  process 
node,  for  resource  sink  nodes  would  represent  unavailable  resources  not  owned  by  any  pro¬ 
cess.  A  P-R  graph  which  does  not  contain  a  knot  is  a  safe  P-R  graph.  Figure  6.4  illus¬ 
trates  a  P-R  graph  which  is  deadlocked.  The  node  S  is  an  initiator  node,  which  will  execute 
the  knot  detection  algorithm. 

There  is  a  distinction  between  whether  a  P-R  graph  is  a  knot,  and  whether  a  P-R  graph 
contains  a  knot.  Certainly,  if  a  P-R  graph  is  a  knot,  then  all  nodes  in  the  graph  are 
deadlocked.  Thus,  a  knot  is  a  sufficient  condition  for  deadlock.  If  a  P-R  graph  contains  a 
knot,  ihen  the  nodes  which  are  not  members  of  the  knot  must  be  nodes  leading  to  the  knot. 
This  is  so,  for  by  definition,  a  knot  does  not  have  edges  leading  to  nodes  which  are  not 
members  of  the  knot.  A  process  node  with  an  edge  leading  into  a  resource  node  in  the  knot 
is  waiting  for  that  resource,  and  hence  must  wait  until  the  knot  is  resolved.  A  resource  node 
with  an  edge  leading  to  a  process  node  is  owned  by  that  process,  and  those  resources  cannot 
be  available  until  the  deadlock  has  been  resolved.  Figure  6.5  illustrates  this  situation.  We 
will  show,  however,  that  a  knot  cannot  occur  without  one  of  its  members  undergoing  a  state 
transition  which  will  cause  it  to  execute  a  detection  algorithm.  Therefore,  all  deadlocks 
represented  by  knots  will  be  detected  by  one  or  more  nodes  in  these  knots.  Any  node  in  a 
P-R  graph  which  leads  to  a  knot  can  therefore  treat  the  knot  as  an  extended  sink ,  and  from 
its  point  of  view,  it  is  not  in  a  deadlock.  If  it  waits  long  enough,  the  members  of  the  sink 
will  resolve  their  own  deadlock.  Therefore,  only  nodes  which  are  members  of  knots  are 
deadlocked,  given  a  strictly  decentralized  point  of  view. 

Assume  a  network  of  autonomous  processes  and  resources,  each  with  computing  capabil¬ 
ity  and  storage  space,  interconnected  by  communications  links.  Process-resource  interactions 
will  take  place  via  these  links.  No  node  maintains  a  global  P-R  graph.  Instead,  each  node 
stores  that  part  of  the  P-R  graph  which  involves  it,  by  describing  the  in-edges  and  out-edges 
which  represent  interactions  with  its  neighbours.  Thus,  the  P-R  graph  is  stored  in  a  distri¬ 
buted  fashion.  This  model  is  exactly  what  we  have  used  for  the  knot  detection  algorithm, 
except  that  it  is  bipartite.  However,  for  purposes  of  knot  detection,  it  matters  not  at  all  that 
there  are  two  types  of  nodes.  Since  the  graph  configuration  is  the  characteristic  of  interest, 
let  us  simply  treat  this  system  as  a  directed  graph  of  one  node  type. 

6.4.1.  Procedure  Detect 

We  now  discuss  the  use  of  the  knot  detection  algorithm  in  a  process-resource  environ¬ 
ment  as  described  above.  First  we  assume  that  resources  have  sufficient  intelligence  so  that 
we  can  ignore  the  bipartite  nature  of  P-R  graphs,  and  treat  all  nodes  equally.  This  is  with 
the  proviso  that  a  directed  edge  from  a  process  to  a  sink  resource  node  is  illegal,  for  it 
would  represent  a  process  blocked  on  an  available  resource. 

We  call  the  use  of  the  knot  detection  algorithm  in  this  environment  an  invocation  of  the 
procedure  Detect.  Process-resource  activity  goes  on  unless  deadlock  is  present.  Therefore  a 
process  initiating  Procedure  Detect  may  find  that  the  graph  changes  while  Detect  is  execut¬ 
ing.  If  an  edge  is  removed,  no  echo  will  be  received  from  it.  If  an  edge  is  created  at  a  node 
after  the  node  has  been  visited  already,  no  echo  would  ever  come  on  that  edge.  Thus,  since 
echo  algorithms  depend  on  all  echos  returning,  a  particular  execution  of  Procedure  Detect 
may  not  return  no  if  the  graph  is  changing.  Certainly,  under  these  conditions  of  change,  it 
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Figure  6.4  A  process-resource  graph  illustrating  a  knot  and  deadlock 
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would  never  return  yes  in  error.  The  purpose  of  executing  Detect  is  to  discover  if  deadlock  is 
present.  If  deadlock  is  not  present,  it  does  not  really  matter  if  no  answer  is  obtained,  as  long 
as  we  can  be  sure  that  in  case  of  deadlock,  Detect  will  always  return  yes.  The  point  is  that 
if  no  answer  is  obtained,  but  deadlock  is  absent,  then  process-resource  activity  is  going  on, 
and  eventually  the  initiator  process  will  get  its  needed  resources,  and  execute.  Barring 
failures,  therefore,  it  either  consumes  or  discovers  that  a  deadlock  exists.  Practically,  a  pro¬ 
cess  may  have  to  decide  that  a  long  wait  is  due  to  the  failure  of  a  node,  rather  than  the 
absence  of  deadlock.  This  issue  is  addressed  in  Section  6.4.2,  on  reliability. 

The  legal  states  of  a  process  are:  running  (R)  and  blocked  (B);  for  a  resource,  they  are 
available  (A)  and  unavailable  (U).  In  the  interactions  between  processes  and  resources,  the 
legal  state  changes  of  a  process  are:  requesting  a  resource  (R  to  R  or  B),  acquiring  a 
resource  (R  to  R;  B  to  R  or  B),  releasing  a  resource  (R  to  R),  and  terminating.  Legal  state 
changes  of  a  resource  are:  allocating  a  unit  to  a  process  (A  to  A  or  U),  acquiring  a  unit 
from  a  process  (A  to  A;  U  to  A).  These  interactions  are  called  process-resource  (P-R)  state 
changes. 

We  define  a  trigger  condition  to  be  a  process  node  becoming  blocked  following  a 
request,  or  having  been  blocked,  and  obtaining  a  resource,  finds  it  is  still  blocked.  Thus,  it  is 
some  state  change  in  a  process  node  which  leaves  it  blocked.  Then  any  process  node  which 
detects  a  trigger  condition  will  cause  an  execution  of  Delect  to  begin,  with  itself  as  the  ini¬ 
tiator  node. 

We  further  stipulate  that  any  node  which  undergoes  a  state  change  will  immediately 
erase  all  its  markings  for  any  current  executions  of  Detect.  Note  that  this  means  that  if 
some  initiator  has  already  sent  an  explorer  to  the  node,  but  there  is  some  other  in-edge  yet 
to  receive  a  subsequent  explorer  from  the  same  initiator,  this  next  explorer  would  be  treated 
as  a  first  explorer.  However,  the  original  first  edge  would  never  receive  an  echo  from  this 
node.  Thus,  the  execution  of  that  Detect  is  effectively  prevented  from  completion. 

Proposition  6.1.  A  necessary  condition  for  the  formation  of  deadlock  in  a  P-R  graph  is  a 
P-R  state  change  which  produces  a  trigger  condition. 

Argument:  The  occurrence  of  deadlock  is  equivalent  to  the  formation  of  a  knot.  There 
are  only  three  legal  edge  transitions  for  P-R  graphs:  edge  formation,  edge  elimination,  and 
edge  reversal.  We  argue  that  if  the  P-R  graph  formed  through  any  of  these  is  a  knot,  then  a 
trigger  condition  must  have  been  induced. 

Consider  edge  formation  in  Figure  6.6.  Squares  represent  processes,  and  circles 
resources.  Clearly  process  p,  in  requesting  resource  r,  will  become  blocked  if  resource  r  has 
only  one  unit,  already  held  by  process  q.  Hence  p  induces  a  trigger  condition.  Next  consider 
edge  elimination.  In  Figure  6.7,  when  p  releases  rl ,  q  is  still  blocked.  This  is  also  a  trigger 
condition.  Finally,  consider  edge  reversal.  In  Figure  6.8,  p  releases  rl,  which  q  acquires. 
However,  q  is  still  blocked.  Again,  this  induces  a  trigger  condition. 

There  are  only  these  three  ways  of  forming  knots  in  P-R  graphs  which  do  not  contain 
knots.  Thus,  the  formation  of  a  knot  necessarily  includes  the  induction  of  a  trigger  condi¬ 
tion.  □ 

Proposition  6.2.  In  a  distributed  P-R  system,  a  necessary  and  sufficient  condition  for  the 
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Figure  6.7 


Figure  6.8 


detection  of  deadlock  is  an  execution  of  Procedure  Detect  which  returns  yes  as  an  answer. 

Argument:  First  consider  necessity.  We  argue  that  by  Proposition  6.1,  deadlock  cannot 
occur  without  inducing  a  trigger  condition,  which  causes  Detect  to  be  executed.  Now  we 
show  sufficiency.  Detect  is  a  knot  detecting  algorithm.  When  it  returns  yes ,  all  nodes 
reached  from  the  initiator  node  are  nodes  which  also  reach  the  initiator  node.  By  definition, 
the  presence  of  a  knot  in  the  P-R  graph  is  equivalent  to  the  presence  of  deadlock.  Hence, 
when  Detect  returns  yes,  deadlock  is  present.  □ 

Note  that  it  is  necessary  for  all  process  nodes  to  be  able  to  detect  trigger  conditions  in 
order  for  Proposition  6.2  to  hold.  For  if  only  a  subset  of  process  nodes  can  detect  trigger 
conditions,  clearly  a  knot  can  be  formed  through  the  legal  transition  of  a  process  node 
which  does  not  have  this  capability.  This  knot  will  therefore  not  be  detected. 

6.4.2.  Reliability  Issues 

Let  us  consider  the  function  of  Detect.  If  the  initiator  is  a  member  of  a  knot,  it  must  get 
an  answer  of  yes.  If  not,  then  it  should  not  get  yes.  In  the  event  of  node  or  line  failures,  we 
must  decide  whether  this  means  that  the  P-R  graph  changes,  and  a  knot  becomes  no  longer 
a  knot,  or  that  a  P-R  graph  does  become  a  knot. 

Let  us  first  assume  a  soft-fail  environment,  in  which  nodes  that  fail  do  so  temporarily. 
It  would  make  no  sense  in  this  case  to  reclaim  any  resources  that  were  held  by  a  failed  node 
for  the  network,  for  then  the  failed  node  would  have  to  restart  at  scratch,  rather  than  at  the 
point  of  failure.  This  is  equivalent  to  permanent  failure.  Thus,  we  must  assume  that  in  a 
soft-fail  environment,  failure  does  not  mean  changing  the  configuration  of  the  P-R  graph. 

In  such  an  environment,  if  a  node  fails  during  the  execution  of  Detect ,  then  either  an 
explorer  cannot  be  sent  out  or  an  echo  cannot  be  sent  back.  In  either  case,  it  is  not  possible 
for  the  detector  to  get  an  answer  of  yes.  If  a  knot  is  present,  it  will  be  detected  after  the 
failed  node  recovers.  If  the  graph  is  safe,  then  no  false  answer  will  be  given  to  the  detector 
node. 

Let  us  assume  now  that  failure  means  that  the  nodes  involved  drop  out  permanently. 
How  does  this  affect  Detect ?  If  a  node  fails  before  it  has  received  its  explorer,  its  sending 
node(s)  will  not  be  able  to  transmit  to  it.  If  these  are  resource  nodes,  then  they  can  erase 
these  out-edges,  because  it  would  mean  that  the  process  that  failed  no  longer  holds  any  units 
of  that  resource.  However,  if  the  node  which  failed  is  a  resource  node,  then  nodes  which 
have  out-edges  to  it  are  process  nodes  which  must  have  that  resource  in  order  to  run.  Thus, 
these  process  nodes  fail  as  well,  and  must  release  their  resources.  In  the  case  that  an 
explorer  cannot  be  sent  to  a  failed  node,  then,  resources  will  be  made  available,  P-R  state 
changes  will  occur,  and  we  don’t  care  about  this  execution  of  Procedure  Detect  any  more. 

If  nodes  fail  after  they  have  received  their  explorers,  then  the  nodes  trying  to  echo  to 
them  will  not  be  able  to  echo.  If  these  are  process  nodes,  it  means  that  the  resources  they 
held  have  vanished.  Thus  they  must  terminate  themselves  and  release  any  other  resources 
they  hold,  in  addition  to  relinquishing  their  requests.  This  causes  legal  P-R  state  changes. 
If  the  failed  node  is  a  process  node,  then  the  node(s)  trying  to  send  it  echoes  are  resource 
node(s).  These  will  know  that  the  failed  process  no  longer  requires  that  resource(s).  How¬ 
ever,  this  does  not  cause  any  P-R  state  changes.  The  nodes  which  are  waiting  for  an  echo 
from  the  failed  process  node  will  never  get  one,  but  since  we  do  not  assume  a  mechanism  by 
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which  waiting  nodes  test  to  see  if  the  other  end  is  still  functioning,  these  nodes  will  wait  for¬ 
ever. 

This  means  that  if  a  process  node  were  to  fail  such  that  it  is  considered  removed,  and 
the  failure  occurred  after  it  forwarded  its  explorers,  then  Detect  would  wait  forever.  If  the 
failure  caused  a  knot  to  form,  that  knot  would  not  be  detected;  if  the  failure  occurred  in  a 
knot,  neither  would  the  knot  be  found,  nor  would  the  transformation  into  a  safe  P-R  graph 
be  detected. 

We  make  the  following  conclusions.  In  a  fail-soft  environment,  in  which  failed  nodes 
resume  at  their  point  of  interruption,  Procedure  Detect  is  robust.  In  a  fail-hard  environ¬ 
ment,  in  which  failed  nodes  disappear  forever,  process  nodes  must  have  a  reassurance 
mechanism  by  which  it  intermittently  lets  the  resources  which  it  holds  know  that  it  is  still 
alive  and  functioning.  If  it  fails,  then  the  resources  stop  getting  their  reassurances,  and  either 
detect  that  the  process  has  failed,  or  assume  it  has  failed,  re-claim  their  resources  and  erase 
the  ( r,p )  edges  in  the  P-R  graph.  The  reclamation  of  resources  will  cause  legal  process- 
resource  state  transitions  to  occur.  Equivalent  to  reassurance  is  a  time-out  mechanism  at  a 
waiting  node,  which  after  timing  out  sends  as  “OK?”  query.  As  long  as  it  gets  the  right 
response,  it  continues  to  wait.  If  it  cannot  communicate  or  gets  no  response  back,  it  has 
detected  failure. 

6.43.  Discussion 

Deadlock  detection  after  the  fact  is  in  some  ways  preferable  to  testing  every  request  for 
safety  before  granting  that  resource.  Firstly,  we  expect  that  in  most  systems,  deadlock  occurs 
infrequently  [SHEM  74],  Secondly,  prevention  requires  the  entire  graph  to  be  searched  for 
each  request.  The  method  we  have  outlined  only  executes  the  procedure  for  a  subset  of 
requests,  namely  those  which  cause  processes  to  block. 

There  are  two  other  approaches  taken  to  deadlock,  prevention  by  pre-claiming  [HABE 
72],  and  avoidance  by  pre-ordering  of  resources  [HAVE  68].  Some  of  these  techniques  have 
also  been  advocated  for  distributed  systems  [CHU  74,  GRAY  74,  MENA  78].  The  method 
proposed  here  clearly  differs  from  most  of  these  in  being  a  general  model  of  process-resource 
interaction.  Thus,  no  hierarchy  of  resources,  no  pre-ordering,  etc.  is  required.  In  addition, 
as  LeLann  [LELA  78]  has  pointed  out,  mutual  exclusion  of  conflicting  processes  is  one  way 
of  avoiding  deadlocks.  In  some  systems,  deadlock  may  occur  more  frequently,  as  in  heavily 
used  database  systems  with  coarse  granularity  and  much  concurrent  updating.  Thus,  both 
detection  and  avoidance  algorithms  are  useful  to  study. 

Consider  now  the  termination  of  Detect.  This  is  not  straightforward,  because  a  single 
execution  of  Procedure  Detect  exists  across  all  the  nodes  of  a  P-R  graph.  In  a  knot,  yes  will 
be  found  by  the  initiator,  and  clearly  Detect  has  terminated.  It  can  then  clear  markings  from 
all  other  nodes.  However,  in  all  safe  P-R  graphs,  we  assume  that  state  changes  will  occur. 
Processes  release  and  request  more  resources,  etc.  This  can  occur  before  Detect  terminates 
at  its  initiator.  Any  node  that  undergoes  a  P-R  state  change  erases  all  Detect  markings,  for 
all  executions  in  which  it  is  involved.  When  this  happens  for  all  nodes,  we  can  certainly  con¬ 
sider  Detect  to  have  terminated.  We  are  also  reassured  by  knowing  that  when  such  a  node  is 
removed  from  consideration  by  Detect ,  then  yes  would  not  be  answered  in  error. 

If  node  D  initiates  Detect ,  then  some  node  in  the  same  P-R  graph  has  markings  for  this 
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invocation  of  Detect.  Assume  that  the  P-R  graph  is  safe,  and  that  node  D  gets  P-R  state 
changes  first,  and  subsequently  becomes  triggered  again,  before  node  i  had  any  P-R  activity. 
Then  it  is  necessary  for  node  i  to  know  that  the  explorer  from  D  is  from  a  different  execu¬ 
tion  of  Detect.  This  must  be  done  with  a  sequence  number,  modulo  M,  some  reasonably 
large  integer. 

We  have  not  said  what  will  happen  when  a  node  finds  it  is  deadlocked.  However,  the  n- 
philosopher  problem  presented  in  Chapter  3  clearly  points  to  us:ng  an  election  method  to 
appoint  a  single  controller  which  will  handle  the  particular  deadlock  at  hand.  For  this,  tech¬ 
niques  have  already  been  described. 
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Chapter  7 


Shortest  Path  Tree 


7.1.  Introduction 

In  this  chapter,  we  address  the  well-known  problem  of  finding  the  shortest  paths  from 
one  node  to  all  other  nodes  in  a  connected  and  weighted  graph  [DIJK  59,  JOHN  77],  As¬ 
sume  that  in  our  graph  model  of  a  network  of  computers,  as  presented  in  Chapter  1,  we  add 
a  finite  positive  weight  to  each  edge.  Each  weight  on  an  edge  (v,w)  represents  the  cost  of 
going  from  v  to  w  or  vice  versa.  For  a  given  initiator  node,  we  would  like  to  know  the  shor¬ 
test  path  in  terms  of  cost  from  it  to  each  of  the  other  nodes.  The  cost  of  a  path  is  simply 
the  sum  of  the  costs  of  the  edges  making  up  the  path.  For  a  particular  initiator  node,  then, 
the  shortest  paths  to  all  other  nodes  will  form  a  tree,  in  which  every  node  appears  only  once. 
This  is  a  spanning  tree,  which  we  call  a  shortest  path  tree ,  or  SPT.  It  is  not  necessarily  the 
same  tree  as  the  minimum  spanning  tree  for  the  graph,  which  is  the  spanning  tree  with  a 
minimum  sum  of  edges  over  all  such  trees. 

Dijkstra’s  method  [DIJK  59]  for  finding  the  shortest  path  tree  from  a  single  source  is  an 
example  of  a  sequential  algorithm  for  solving  this  problem.  It  is  based  on  two  global  opera¬ 
tions:  set  construction,  and  minima  finding.  As  the  algorithm  proceeds,  each  node  is  la¬ 
belled  with  a  weight,  which  represents  the  cost  of  reaching  it  from  the  source  by  the  best 
path  chosen.  The  set  ACCEPT  is  used  to  hold  both  chosen  nodes  and  edges. 

1.  Start  with  the  source  node.  Place  it  in  the  set  ACCEPT,  and  give  it  a  weight  of  zero. 

2.  Compute  the  costs  of  reaching  all  immediate  successors  of  the  nodes  in  ACCEPT.  The 
cost  of  reaching  a  node  w  from  a  node  v  is  the  weight  of  v  plus  the  weight  of  the  edge 
(v,w).  Take  the  minimum  of  these  destination  nodes,  and  place  it  with  its  computed 
weight,  and  the  edge  leading  to  it,  in  ACCEPT.  If  two  minima  are  found,  choose  one 
arbitrarily. 

3.  Now  repeat  (2)  until  all  nodes  have  been  placed  in  ACCEPT.  The  nodes  and  edges  in 
ACCEPT  will  make  up  the  SPT,  rooted  at  the  source  node. 

Figure  7.1  shows  a  connected  weighted  graph,  and  its  shortest  path  tree.  This  method 
has  been  shown  to  require  0(n2)  compares. 

We  can  see  how  difficult  this  particular  algorithm  would  be  to  implement  in  a  decentral¬ 
ized  way.  Each  major  step  requires  that  the  minimum  of  edges  from  the  set  of  accepted 
nodes,  leading  to  nodes  not  in  the  set  be  found  and  marked.  It  is  easy  to  mark  a  node 
accepted.  It  is  not  easy  to  find,  at  every  step,  the  minimum  edge  of  a  new  set  without  retrac¬ 
ing  the  same  edges  and  nodes  many  times.  The  Bellman  Ford  Moore  algorithm  [DREY  69] 
is  one  which  lends  itself  more  readily  to  a  decentralized  approach.  Given  a  weighted  graph 
as  described  above,  let  ffk '  represent  the  length  of  the  shortest  path  that  connects  node  1  to 
node  i  and  that  contains  k  + 1  or  fewer  arcs.  Then  the  recursive  formula  is: 

/,<*+')  =  min jldj,  +//*>]  fim  =  d„ 
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Figure  7.1  A  connected  weighted  graph  and  its  shortest  path 
tree,  rooted  at  node  (a) . 


However,  there  are  also  fundamental  difficulties  in  using  this  as  a  decentralized  algo¬ 
rithm.  An  asynchronous  system  does  not  really  support  the  concept  of  step ,  and  therefore 
finding  the  minimum  at  each  step  is  a  non-trivial  task.  We  first  introduce  a  decentralized 
algorithm  which  knows  nothing  of  steps,  and  accepts  each  improvement  at  a  node  as  a  final 
one.  A  careful  analysis  of  the  number  of  message  passes  for  a  fully  connected  graph  is 
made.  Then  we  introduce  two  mechanisms  for  keeping  track  of  steps,  allowing  a  minimum 
at  each  node  to  be  found,  for  each  step.  The  first  method  applies  best  to  a  fully  connected 
graph,  and  gives  essentially  a  decentralized  Bellman  Ford  Moore  algorithm.  The  second  one 
is  more  general,  applying  to  arbitrary  graphs,  and  is  not  analogous  to  any  sequential  shortest 
path  algorithm. 

The  results  which  Friedman  obtained  [FRIE  78]  in  parallel  and  independent  work  on 
distributed  shortest  path  algorithms  are  also  based  on  message  passing.  Instead  of  using 
explorers  to  find  best  paths,  however,  he  relies  more  on  sending  edge  lengths  around  the 
graph.  This  results  in  algorithms  which  are  often  very  complex.  The  decentralized  Bellman 
Ford  Moore  algorithm  he  presents  uses  order  0(en2)  message  passes,  whereas  our  step 
algorithms  require  O(ne)  message  passes,  e  being  total  edges  and  n  total  nodes.  He  does, 
however,  present  other  algorithms  which  are  of  O(en)  message  passes,  also  based  on  passing 
edge  information.  We  feel  that  our  work,  using  explorers  and  echos,  is  complementary  to 
Friedman’s,  and  offers  an  alternative  approach  which  is  both  simpler  and  more  effective. 

Finally,  we  note  that  we  are  addressing  the  shortest  path  problem  in  a  static  graph.  This 
is  not  the  routing  problem  in  computer  networks,  which  operates  in  a  changing  environment, 
and  continually  updates  tables  of  current  best  paths  at  each  node.  For  example,  see  Naylor 
[NAYL  77],  We  are  instead  studying  a  much  simpler  problem. 

7.2.  Decentralized  Shortest  Path  Tree  Algorithm 

Given  an  undirected  and  connected  graph,  let  us  assume  a  positive  weight  on  each  edge. 
The  method  we  use  is  based  on  the  principle  of  immediate  gratification,  and  gives  rise  to  a 
selfish  SPT  algorithm.  We  use  explorers  to  carry  the  sum  of  the  edges  it  has  traversed. 
Each  node  has  a  current  best  weight  and  best  edge.  Initially,  each  node  has  a  best  weight  of 
oo.  We  can  think  of  an  explorer  with  weight  x  arriving  at  node  v  from  node  w  as  offering 
node  v  a  weight  of  x  to  be  connected  to  the  initiator  node  via  node  w.  If  this  is  better  than 
v’s  current  weight,  [v,w]  is  chosen  to  be  its  new  best  edge,  and  x  as  its  new  best  weight. 

At  this  point,  we  depart  from  our  established  practice  followed  in  other  echo  algorithms: 
that  subsequent  explorers  terminate.  For  if  an  explorer  offers  a  better  weight  to  v,  and  is  a 
subsequent  explorer,  then  v  has  already  sent  out  other  explorers,  with  a  worse  weight  than  is 
now  feasible.  Therefore,  as  it  gets  better  weights,  it  must  send  out  new  explorers  on  all  other 
edges,  to  offer  those  nodes  a  smaller  cost  path  to  connect  to  the  originator  node. 

At  the  end  of  all  explorer  activity,  each  node  will  have  found  its  best  edge  for  going 
back  to  the  initiator  node.  However,  none  of  the  nodes  have  identified  the  successor  nodes 
which  they  would  reach  in  the  SPT,  in  the  outward  direction  from  the  initiator  node.  Thus, 
we  must  design  our  algorithm  such  that  the  leaves  of  the  SPT  start  connecting  back  to  their 
best  nodes,  and  as  this  happens,  each  node  will  form  a  set  of  forward  edges,  each  such  edge 
having  an  associated  set  of-nodes  reachable  from  it  along  the  SPT.  Finally,  then,  the  initia¬ 
tor  node  would  know  which  are  its  forward  edges,  and  what  nodes  are  reachable  on  each  of 
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them.  To  send  a  message  to  a  particular  node,  it  sends  it  on  a  specific  forward  edge,  and 
each  successor  node  can  in  the  same  way  choose  the  proper  forward  edge  to  pass  the  mes¬ 
sage  on  toward  its  destination. 

If  the  cost  on  each  edge  actually  represented  the  time  delay,  then  clearly  a  simple  echo 
algorithm  traversal  of  the  graph  yields  the  shortest  path  tree,  for  the  shortest  paths  must  he 
those  on  which  the  explorers  first  get  to  visit  nodes.  The  forward  edges  can  then  be  built 
simply  from  echos  coming  back  on  P-edges,  recalling  from  Chapter  4  that  a  P-edge  is  an 
edge  on  which  a  primary  explorer  is  carried,  i.e.,  a  node  is  visited  for  the  first  time.  How¬ 
ever,  we  must  address  instead  the  general  case  in  which  we  make  no  precise  statements 
about  the  speed  of  traversal  of  edges.  As  stated  in  Chapter  1,  we  will  resort  to  the  simplify¬ 
ing  assumption  that  each  edge  takes  approximately  the  same  time  to  traverse,  but  not  pre¬ 
cisely  so. 

Algorithm  7.0  Selfish  SPT  Algorithm 

There  are  two  distinct  components  to  this  algorithm.  The  first,  and  by  far  the  simpler, 
deals  with  sending  out  new  explorers  if  a  node  finds  its  cost  to  connect  to  the  initiator 
improved  by  the  arrival  of  a  subsequent  explorer.  The  second  mechanism  deals  with  the 
knowledge  of  termination  of  the  algorithm,  and  of  the  problem  of  collecting  forward  edges 
for  each  node,  as  the  algorithm  proceeds. 

Let  the  graph  be  G,  and  the  initiator  node  be  S.  Let  the  best  weight  for  the  initiator 
node  be  zero,  and  °°  for  the  rest,  initially.  We  introduce  some  additional  terminology.  Con¬ 
sider  a  node  v.  For  every  edge  e  of  v,  let  it  have  a  weight  e.wt.  Let  v.wt  refer  to  the  current 
best  weight  of  node  v.  Let  an  explorer  E  arrive  at  node  v  carrying  weight  x,  having  travelled 
on  some  path  [p]  from  the  initiator.  This  path  [p]  has  as  its  last  node  v,  and  therefore  its 
path  up  to  v  is  [q].  We  write  this  as  [q]Ov-»  [p],  where  O  represents  the  extension  of  the 
path  [q]  by  the  edge  leading  to  the  node  v. 

Let  us  deal  with  explorers  and  their  individual  terminations.  Consider  the  mechanism  for 
explorers: 

1.  The  initiator  sends  out  explorers  in  parallel.  Each  explorer  carries  the  weight  of  its 
edge  from  the  initiator. 

2.  If  an  explorer  with  weight  x  comes  to  node  v  and  x  is  less  than  v.wt,  an  improvement 
is  possible.  The  node  takes  its  new  best  edge  to  be  the  edge  of  arrival  of  the  explorer, 
and  x  to  be  its  new  v.wt,  and  new  explorers  are  sent  in  parallel  on  its  out-edges.  Each 
such  explorer  on  an  edge  e  gets  a  weight  of  x  +  e.wt. 

3.  If  v  is  a  sink  node,  it  updates  v.wt  marks  its  only  edge  as  its  best  edge,  and  then 
creates  an  echo,  terminating  the  explorer. 

4.  If  the  weight  x  on  an  explorer  is  greater  or  equal  to  v.wt,  the  explorer  also  terminates 
and  an  echo  is  formed  at  v. 

Now  let  us  introduce  the  method  of  dealing  with  knowledge  of  termination,  and  of  iden- 

% 

tifying  forward  edges  in  the  SPT.  This  will  involve  mainly  the  echo  mechanism,  although 
explorer  activity  is  affected  to  a  minor  degree.  First  of  all,  every  explorer  which  sends  out  a 
new  set  of  explorers  from  v  must  get  its  own  set  of  echos  back,  and  must  echo  along  the 
proper  edge.  These  explorers  are  differentiated  by  the  different  paths  they  took  to  get  to  v. 
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In  other  words,  recalling  that  the  path  up  to  v  was  called  [q],  each  such  explorer  has  a 
different  [q].  When  a  particular  explorer  gets  all  its  echos  back,  it  sends  the  echo  to  the  last 
node  in  its  [q]  path.  Therefore,  each  explorer  carries  its  cumulative  path,  and  each  echo  car¬ 
ries  the  path  yet  to  be  taken  to  signal  all  the  nodes  visited  by  its  explorer  counterpart. 

Let  us  call  an  explorer  which  causes  more  explorers  to  be  sent  out  a  ^-explorer.  To 
know  that  all  echos  have  come  back  for  a  particular  ^-explorer,  we  only  need  keep  a  bit¬ 
map  of  all  out-edges  for  each  ^-explorer,  with  its  own  edge  of  arrival  marked  ON.  Each 
arriving  echo  at  an  edge  turns  its  appropriate  bit  ON,  and  when  they  have  all  arrived,  the 
entire  bit-map  is  ON. 

Now  consider  the  explorers  sent  out  by  the  arrival  of  a  ^-explorer  at  v.  Assume  that  they 
all  go  to  nodes  at  which  no  improvement  is  possible,  and  echos  will  return.  Each  echo  will 
bring  back  the  echo. SET  of  <t>,  for  clearly,  no  edge  from  v  was  a  best  path  reaching  any 
node.  When  v  gets  this  information  back,  it  makes  the  assumption  that  there  is  no  further 
activity,  and  it  sends  the  echo  back  for  the  predecessor  of  this  ^-explorer,  with  echo. SET  of 
{vj.  Now  its  predecessor,  call  it  u,  getting  an  echo  with  echo. SET  ^  0,  will  label  [u,v]  as  a 
forward  edge  of  u ,  with  the  reachset  of  that  edge  as  the  node  set  (v). 

If,  however,  one  of  the  nodes  reached  from  the  successors  of  v  should  receive  its  own  q- 
ejcplorer,  which  would  then  make  an  improvement  to  v,  then  v  would  send  an  explorer  to  u 
in  its  turn.  At  node  u,  edge  e  is  no  longer  be  the  best  way  to  get  to  v,  so  it  cannot  remain 
as  a  forward  edge.  When  an  explorer  arrives  at  a  node,  therefore,  on  edge  e,  it  clears  e  as  a 
forward  edge  and  clears  the  reachset  of  e. 

Finally,  an  explorer  from  w  which  terminates  at  a  sink  node  v  creates  an  echo  with 
echo.SET  of  {v},  since  v  was  reachable  on  a  best  edge  from  tv.  If  the  echos  for  a  ^-explorer 
have  several  non-empty  echo. SETs,  then  its  own  echo  contains  an  echo.SET  which  is  the 
union  of  the  others. 

1.  An  explorer  arriving  at  node  v  on  edge  e  clears  any  forward  edge  marking  for  e,  and 
its  reachset.  Now  try  (2). 

2.  An  explorer  terminating  at  a  sink  node  v,  and  having  travelled  path  p,  will  create  the 
following  echo:  echo.SET  *-  {vj,  and  echo. PATH  •*-  [p]\v,  wherex stands  for  the  shor¬ 
tening  of  path  [p]  by  dropping  its  last  node  v. 

3.  Consider  an  echo  arriving  at  node  v  on  edge  e  carrying  echo.SET  of  e.SET  and 
echo. PATH  of  e.path.  Firstly,  if  e.SET  ^  0,  then  mark  e  as  a  forward  edge  of  v,  and 
e.SET  as  the  reachset  of  e.  Go  to  (4). 

4.  Since  e.path  is  some  [q]  O  v  -*•  [p],  the  echo  returning  to  v  was  initiated  by  some  q- 
explorer  from  v.  Thus,  the  edge  e  is  checked  off  against  the  bit-map  for  q ,  and  a  set  of 
reachable  nodes  for  q  is  built  by  g.SET  •*-  q. SET  U  e.SET.  Of  course,  4. SET 
represents  all  nodes  which  are  reachable  on  best  edges  from  v,  if  the  particular  q- 
explorer  travelled  in  the  SPT.  Go  to  (5). 

5.  If  all  edges  for  the  ^-explorer  have  come  back,  then  send  the  echo  with  echo.SET  *~ 
g.SET  and  echo. PATH  ■*-  [q]\v,  to  the  last  node  in  the  new  echo. PATH.  Go  to  (6). 
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6.  Of  course,  if  v  is  the  initiator  node,  5,  then  we  are  done. 

At  the  end  of  the  algorithm,  each  node  has  a  set  of  forward  edges,  each  associated  with 
the  set  of  nodes  which  can  be  reached  on  that  edge.  In  addition,  each  node  also  has  a  best 
edge  which  will  take  it  back  to  the  initiator  node.  To  send  a  message  from  the  source  to  a 
node  x,  it  just  finds  the  forward  edge  containing  x,  and  this  routing  occurs  at  each  node  until 
x  is  reached.  Figure  7.2  shows  a  weighted  graph,  and  intermediate  and  final  stages  in  the 
finding  of  the  shortest  path  tree. 

7.3.  Behaviour 

Let  us  consider  this  algorithm  using  the  following  measures:  elapsed  time,  storage  at  a 
node,  and  number  of  messages  in  total.  First  of  all,  to  simplify  analysis,  let  us  assume  that 
the  speeds  of  explorers  and  echos,  though  not  deterministic,  are  approximately  the  same. 
Then,  by  examining  Figure  7.3,  we  see  that  the  algorithm  is  exploring  the  tree  of  all  possible 
paths  in  parallel.  Define  a  distinct  path  to  be  a  path  rooted  at  the  source,  which  contains  no 
cycles.  If  we  assume  that  processing  and  queueing  costs  have  been  considered  in  edge  traver¬ 
sal  estimates,  then  the  elapsed  time  for  the  forward  phase  is  only  the 
longest  distinct  path  in  the  graph. 

Another  way  of  looking  at  this  result  is  that  of  conservation  of  work  done  at  each  path. 
Given  a  node  with  some  weight  at  an  intermediate  point,  if  a  better  weight  is  offered  by  an 
explorer  coming  later,  and  therefore  involving  more  nodes,  then  the  sub-trees  which  must  be 
explored  following  this  improvement  cannot  contain  any  of  the  nodes  already  in  the  path  of 
the  explorer.  Conversely,  if  a  better  weight  comes  early  to  a  node,  the  number  of  nodes  yet 
to  be  explored  is  larger,  but  only  a  few  nodes  have  been  traversed  to  date. 

The  amount  of  storage  at  each  node  is  dominated  by  the  path  name  that  must  be  kept 
for  each  explorer  that  makes  an  improvement.  A  path  name  is  at  most  n  nodes  long,  and 
thus  takes  (n  log«)  bits.  There  are  at  most  n  such  explorers,  and  therefore  n2  log  n  bits  will 
be  needed  at  each  node. 

Finally,  consider  the  number  of  message  passes  needed.  Certainly,  the  tree  of  all  possi¬ 
ble  paths  contains  n !  edges  for  a  fully  connected  graph.  However,  we  can  see  from  Figure 
7.4  that  many  possible  sub-trees  are  not  actually  traversed.  In  this  Figure,  we  have  assumed 
that  explorers  take  approximately  the  same  speeds.  Thus,  each  of  the  levels  of  the  tree  occur 
before  any  activity  in  a  subsequent  level.  Certainly,  if  the  system  is  totally  unpredictable  in 
terms  of  time  for  each  edge  traversal,  we  cannot  know  what  sub-trees  are  eliminated.  How¬ 
ever,  even  if  we  assume  approximately  equal  speeds,  the  problem  is  still  very  difficult. 
Observe  that  paths  are  symmetrical  in  that  if  a  longest  cycle  a.b ,c...y.z.a  exists,  then  an 
inverse  a.z.y...b.a  cycle  is  also  present.  Only  one  of  these  can  complete,  and  this  is  true  for 
all  cycles.  We  now  turn  to  the  study  of  a  worst  case  analysis  for  the  shortest  path  algo¬ 
rithm. 

7.4.  Worst  Case  Analysis 

We  argue  that  a  fully  connected  graph  of  n  nodes  gives  us  the  worst  case  for  number  of 
messages,  as  each  node  has  the  maximum  number  of  choices  to  make,  which  is  n,  and  each 
node  may  have  to  send  n  —  1  messages  as  many  times  as  improvements  can  occur  on  new 
explorers  coming  from  n  edges.  Consider  the  all  possible  paths  tree  from  an  initiator  node  in 
this  graph.  There  are  n  levels,  with  a  multiplying  factor  of  n  at  each  level.  Thus,  the 
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(3) 


showing  each  node  having  chosen 
its  best  edge  to  connect  to  the 
detector  node  (a) ,  and  the  weights 
for  connecting,  at  each  node. 


the  shortest  path  tree,  rooted 
at  node  (a) .  Each  edge  of  the 
tree  holds  the  set  of  nodes 
which  are  reachable  along  the 
tree  from  that  edge. 


Figure  7.2  A  weighted  graph,  an  intermediate  stage  and  the  final 

stage  in  finding  the  shortest  path  tree  rooted  at  node  a. 


Figure  7.3  A  weighted  C-graph  and 
its  all-possible  paths  tree 


a 


Figure  7.4  The  tree  of  all  possible  paths,  rooted  at  node  a 
Sub-trees  which  are  not  traversed  are  circled  in 
by  a  dotted  line. 


number  of  edges  in  the  tree  is  of  the  order  ( nn ).  This  is  clearly  exponential.  However,  we 
shall  try  to  define  the  actual  bound  more  precisely. 

Figure  7.5  is  an  illustration  of  a  partial  execution  of  a  selfish  SPT  algorithm  for  a  fully 
connected  graph  of  n  =6.  We  use  it  to  make  some  important  observations  and  assumptions. 
First,  the  algorithm  can  be  thought  of  as  proceeding  in  steps.  In  the  first  step,  explorers  are 
sent  from  the  initiator  to  all  other  nodes.  In  the  second  step,  each  node  sends  explorers  to 
neighbours  it  has  not  ever  received  from,  and  so  on.  We  assume  that  each  step  completes 
before  the  next  begins,  so  that  all  the  nodes  receive  messages  sent  to  them  in  one  step  before 
they  start  the  next  step.  Each  explorer  maintains  the  path  it  has  taken  from  the  initiator,  so 
that  at  a  node  /,  if  it  should  make  an  improvement,  it  will  send  messages  to  all  nodes  except 
itself  and  those  already  in  its  path.  This  guarantees  that  no  circular  paths  are  explored. 

We  do  not,  however,  require  that  explorers  be  uniform  in  speed.  Thus,  at  any  step,  a 
node  may  receive  many  explorers,  but  they  may  arrive  in  any  order.  We  assume  that  they 
are  processed  as  they  arrive,  instantaneously,  and  furthermore,  that  the  worst  case  happens, 
so  that  the  best  message  arrives  last.  This  means  that  at  any  step  as  few  explorers  are 
extinguished  as  possible.  We  worry  only  about  the  explorer  phase,  letting  explorers  ter¬ 
minate  when  they  come  to  a  smaller  node. 

In  the  best  case,  all  shortest  paths  from  the  initiator  node,  which  we  shall  call  a  from 
now  on,  can  be  found  in  one  step.  But  at  worst,  only  n  steps  are  needed.  Furthermore,  at 
every  step,  at  least  one  node  must  have  found  its  best  weight,  in  the  sense  that  all  future 
explorers  arriving  there  will  terminate. 

Now  that  we  have  defined  the  rules  for  the  operation  of  the  selfish  algorithm  for  a  fully 
connected  graph,  let  us  consider  how  we  may  count  them.  Our  technique  is  to  derive  an 
expression  for  the  number  of  realizable  leaves  at  each  level  of  the  execution  tree,  if  the  exe¬ 
cution  were  to  be  stopped  at  that  level.  Each  such  leaf  corresponds  to  a  message  generated 
at  the  level  before.  We  then  sum  this  expression  over  all  n  —  1  levels,  to  get  the  total  number 
of  edges  in  the  entire  execution  tree. 

To  understand  the  concept  of  a  realizable  leaf,  we  introduce  the  following  Lemma. 

Lemma  7.1.  Consider  explorers  on  the  paths  abc  and  acb.  One  of  them  is  logically  excluded 
from  continuing. 

Argument:  For  brevity,  we  represent  the  weights  on  the  edges  by  the  edge  names.  Then  node 
b  gets  two  explorers,  with  weights  ab  and  ac+bc,  while  c  gets  explorers  with  weights  ac  and 
ab+bc.  Figure  7.6  illustrates  this  case.  At  node  b ,  if  ab  >  ac+bc ,  then  explorer  acb  goes 
on.  However,  this  certainly  means  that,  substituting  for  ab  at  node  c,  explorer  abc  has  a 
weight  >  ac  +bc  +bc,  which  is  certainly  greater  than  ac,  as  long  as  all  edge  weights  are 
positive.  Hence  explorer  abc  must  stop.  The  reverse  argument  is  symmetric. 

We  can  call  this  the  SPT-triangular  inequality.  Note  that  it  holds  regardless  of  how 
many  nodes  might  lie  on  the  edges  ab.ac.bc.  The  essential  feature  is  that  a  common  edge  is 
traversed  in  the  opposite  direction  by  two  explbrers,  leading  to  a  situation  in  which  only  one 
explorer  can  continue.  □ 

Now  study  Figure  7.5  again.  At  level  1,  there  are  (n  —  1)  possible  leaves.  At  level  2,  there 
are  («  —  !)(«— 2)  possible  leaves.  At  level  k,  there  are  (n  —  \)(n  —2)...(n  —k)  possible  leaves. 
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Figure  7.6  The  SPT-triangular  inequality 
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where  k  >1.  Consider  abc  and  acb  in  terms  of  the  SPT-triangular  inequality.  If  abc  goes 
on,  then  acb  is  non-viable ,  and  will  not  produce  any  explorers  in  the  next  step.  For  Figure 
7.5,  its  descendants  in  the  next  step,  to  d,e,f  will  not  occur.  These  are  the  non-realizable 
nodes  in  level  3.  For  a  particular  level,  the  realizable  nodes  are  those  descended  from  the 
viable  nodes  of  the  level  before.  To  compute  the  realizable  nodes  of  level  k,  observe  that  the 
viable  nodes  of  level  k-I  are  a  specific  proportion  of  the  possible  nodes  at  that  level. 


Proposition  7.1.  At  a  level  /  of  an  execution  of  the  selfish  SPT  algorithm  as  described  above 
on  a  fully  connected  graph,  a  path  from  the  initiator  node  is  of  length  /.  Then  the  viable 
nodes  at  level  i  are  exactly  l/(i  —  1)!  of  all  the  possible  nodes  at  that  level. 

Argument:  Any  path  of  length  i  from  the  initiator  node  a  involves  exactly  /'  —  I  other  nodes 
besides  a.  Given  all  possible  paths,  there  are  (/  —  1)!  permutations  of  these  nodes  also 
rooted  in  a  which  are  paths  in  the  execution  tree,  of  length  Thus,  if  one  path  is  abed ,  oth¬ 
ers  will  be  abdc.acdb.acbd .  But  whichever  is  assumed  viable,  and  we  assume  without  loss 

of  generality  that  abed  is,  then  all  the  others  are  not.  Given  two  sequences,  one  of  which  is 
a  permutation  of  the  other,  then  by  definition  there  will  be  at  least  one  pair  of  elements 
which  are  reversed  in  order  in  the  two  sequences.  Since  the  permutations  of  our  paths  share 
a  common  initial  node,  the  existence  of  the  reverse  order  of  at  least  one  pair  of  nodes 
implies  that  any  pair  of  such  paths  must  obey  the  SPT-triangular  inequality.  Hence  in  the 
(/  —  1)!  permutations,  there  will  be  only  one  viable  path.  It  follows,  therefore,  that  the 
number  of  viable  nodes  at  level  i  is  one  for  each  permutation  set,  i.e.,  l/(z  —  1)!  of  all  nodes 
at  that  level.  □ 


Now  we  can  compute  the  realizable  nodes  at  level  k.  The  possible  nodes  at  level  k  —  1 
are  ( n  —  1  )(n  —2 )...{n  —k  +1).  Since  (k  —  1)!  of  them  are  permutations,  the  number  of  viable 
nodes  at  level  k  —  1  are: 

(n  -  1  )(n  -2 )„.(n  -k  +  1) 

(k  -  1)! 

1  will  give  rise  to  n  —  k  descendents.  Therefore,  the  number  of 

( n  —  1)(m  —  2)...(n  —  k  +  l)(tt  —  k) 

(k  -  1)! 

Each  realizable  node  represents  a  traversed  edge.  Thus,  to  get  the  total  number  of  edges 
in  the  execution,  we  sum  this  expression  over  the  n  —  1  levels.  We  observe  that 

(n-iy. 

k\(n-k- 1)! 


Each  viable  node  at  level  k  — 
realizable  nodes  at  level  k  is 


=  (n-\)(n-2)...{n-k) 
(*—!)! 


Since  we  know  that 


(1+x)"-1  = 

k  =0 

-J-O  +*)"-'  = 
dx 


n  -  1 
k 

n  —1 
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XK 


.k- 1 
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=  (n-\)(\+x) 


n  -2 


Evaluating  at  x  =  1 ,  we  get 


(n  —\)2n~2 


This  result,  (n  — 1)2"  “2  is  a  precise  upper  bound  for  the  selfish  SPT  algorithm  in  the  exe¬ 
cution,  as  defined  above,  of  a  fully  connected  graph.  This  is  still  an  exponential  result,  how¬ 
ever.  We  could  make  a  significant  improvement  if  at  each  level,  each  node  were  restricted  to 
sending  out  messages  only  once. 

7.4.1.  Decentralized  Clock  Scheme 

The  fully  connected  graph  SPT  algorithm  can  be  improved  by  requiring  each  node  to 
wait  until  all  the  messages  due  it  at  a  particular  level  arrive.  Permitting  only  one  of  the 
explorers  to  survive  will  pare  the  execution  tree  much  further.  At  level  zero,  the  source  node 
broadcasts  n  —  1  messages.  At  level  1,  each  of  n  —  1  nodes  broadcast  n—  2  messages,  but  one 
of  them  will  have  been  connected  via  its  best  path  to  the  source.  Therefore  it  will  never  send 
any  further  messages.  Thus,  at  level  2,  n  —  2  nodes  will  broadcast  n—  3  messages.  At  level  k, 
n—k  nodes  each  send  out  n—k—  1  messages  for  a  total  of  (n  —k){n  —k  —  1)  messages. 
Clearly,  the  total  number  of  messages  for  such  an  execution  will  be  of  0(n3),  since  there 
are  only  k  —  1  levels. 

To  effect  this  improvement,  each  node  must  wait  for  all  its  explorers  for  a  given  level  to 
arrive.  Unfortunately,  there  is  no  simple  pattern  which  will  allow  each  node  to  know 
exactly  how  many  to  expect  at  each  level.  Thus,  we  impose  a  simple  clock  scheme,  centred 
on  S.  Each  step  is  controlled  by  5  sending  a  clock  pulse.  Thus,  S  starts,  being  the  initiator, 
sends  explorers  to  the  other  nodes,  and  waits  for  echos,  which  are  simply  acknowledgements 
of  activity  arrival.  When  all  echos  have  returned,  it  sends  a  clock  pulse,  which  is  simply  an 
activation  message,  to  the  other  nodes.  They  now  send  explorers  according  to  rules  described 
above,  wait  for  their  echos,  then  themselves  echo  to  S.  Another  clock  pulse  is  now  sent. 
This  triggers  any  further  explorers  for  the  next  level  of  activity,  and  so  on.  Therefore,  n  —  1 
levels  of  activity  requires  that  many  clock  pulses. 

If  each  edge  took  about  one  unit  of  time  to  traverse,  then  the  elapsed  time  needed  to 
execute  this  algorithm  is  n  clock  cycles.  Each  of  these  requires  one  clock  pulse,  one 
explorer  activation  and  two  echos.  Thus,  the  elapsed  time  is  approximately  4n.  given  our 
assumptions  of  negligible  processing  and  queueing  times. 

Each  clock  cycle  requires  echos  of  messages  to  all  nodes  before  they  echo  the  initiator. 
Thus,  a  bound  of  n2  echos  are  needed.  For  the  algorithm,  then,  the  total  messages  is  still 
0(«3). 

7.5.  The  Clocked  General  Graph  Case 

In  the  case  of  the  general  graph,  the  decentralized  clock  concept  is  also  applicable.  Let 
each  node  keep,  in  addition  to  the  current  weight  and  current  best  edge,  a  weight  associated 
with  each  edge.  Then  Step  0  is  an  execution  of  a  pure  traversal  algorithm,  except  for  two 
things.  The  first  requirement  is  that  each  explorer  carries  its  cumulative  path  weight,  which 
is  recorded  at  its  destination  node.  Thus,  each  node  eventually  gets  a  weight  at  each  edge. 
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representing  the  cost  of  connecting  to  the  initiator  S’  through  that  edge. 

The  second  is  that  we  want  to  record  the  spanning  tree  induced  by  this  first  traversal. 
Recall  from  Chapter  4  that  this  is  called  the  P-tree,  and  each  of  its  nodes  is  a  P-node  in  that 
it  has  one  or  more  edges  on  which  it  has  sent  out  explorers  which  are  first  at  their  successor 
nodes.  We  identify  echos  from  P-nodes  as  P-echos.  Therefore,  nodes  which  are  the  leaves  of 
the  P-tree  get  only  non-P  echos.  In  turn,  each  of  them  sends  a  P-echo  on  their  first  edge. 
Thus,  at  the  end  of  the  traversal,  each  node  will  know  what  its  first  edge  is,  and  which  of  its 
other  edges  are  P-edges.  This  spanning  tree  will  be  useful. 

When  the  initiator  S  receives  all  its  echos  from  this  traversal,  then  it  starts  Step  1.  A 
clock  pulse  (another  type  of  explorer)  is  sent  in  parallel  on  the  P-tree  which  was  built.  Each 
node,  on  receipt  of  a  clock  pulse,  finds  from  all  its  edges  the  one  with  the  best  weight.  Fol¬ 
lowing  Step  0,  the  current  best  weight  at  a  node  is  the  weight  carried  by  its  first  explorer, 
and  its  current  best  edge  is  initially  its  first  edge.  If  the  new  best  edge  is  different  from  its 
current  best  edge,  then  it  updates  its  current  best  edge  and  its  current  best  weight,  and  sends 
explorers  with  updated  weights  on  all  edges  except  its  new  current  best  edge. 

Each  such  explorer  arrival  at  its  successor  node  triggers  a  receipt.  A  node  must  weight 
for  all  its  receipts  to  arrive.  If  the  node  is  a  leaf  of  the  P-tree,  it  can  now  send  a  change 
clock  echo  back  on  its  first  edge.  If  it  is  an  internal  node  of  the  P-tree,  it  must  wait  not 
only  for  all  its  receipts  from  its  neighbours,  but  also  for  clock  echos  from  its  immediate  des¬ 
cendants  in  the  P—tree.  A  node  which  had  a  change  in  best  edge  or  receives  a  change  clock 
echo  itself  creates  a  change  clock  echo;  otherwise  it  sends  back  a  static  clock  echo.  Thus, 
when  the  initiator  node  5  receives  only  static  clock  echos,  no  improvement  occurred  in  the 
last  step.  The  algorithm  is  done. 

If  this  is  not  so,  then  the  next  step  is  the  sending  of  another  clock  pulse  down  the  P-tree. 
This  mechanism,  then,  allows  us  to  detect  the  end  of  all  activity  at  each  Step,  as  well  as  the 
end  of  all  activity. 

We  have  shown  that  the  longest  path  in  the  all  possible  paths  tree  is  n,  the  number  of 
nodes  in  the  graph.  The  clocked  method  is  clearly  just  the  selfish  SPT  algorithm  with  a 
built-in  delay  at  each  node,  so  that  the  best  choice  can  be  made  at  each  step.  Thus,  it  is  exe¬ 
cuting  the  all-possible  paths  tree,  except  that  it  is  economizing  on  messages.  However,  the 
maximum  number  of  steps  is  n.  At  each  step,  each  node  can  only  pass  to,  in  the  worst  case 
of  a  fully  connected  graph,  n—  1  neighbours.  Thus  the  total  number  of  messages  regarding 
weights  is  bounded  by  n 3.  The  messages  required  for  clock  pulses  are  2 n  each  cycle,  and 
there  are  at  most  n  cycles.  Hence,  total  messages  remain  bounded  by  n  3  Clearly,  the  time 
required  to  execute  the  algorithm  is  bounded  by  n  steps,  each  requiring  ID  time,  where  D  is 
the  diameter  of  the  graph.  Thus,  total  time  is  bounded  by  2 nD. 

7.6.  Discussion 

Finding  the  shortest  path  tree  in  a  connected,  undirected  and  weighted  graph  by  a  totally 
decentralized  method  involves  some  degree  of  difficulty.  The  straightforward  selfish  algo¬ 
rithm  required  a  number  of  messages  which  is  an  exponential  function  of  n,  the  number  of 
nodes.  In  fact,  it  is  bounded  in  the  worst  case  of  a  fully  connected  graph,  by  2”  message 
passes.  We  have  seen  that  introducing  the  device  of  a  decentralized  clock  pulse  allowed  us  to 
reduce  the  number  of  message  passes  needed  to  be  bounded  by  n 3,  for  both  the  fully 
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connected  as  well  as  the  general  graph.  In  this  last  problem,  we  found  that  the  spanning  tree 
induced  by  an  initial  parallel  traversal  of  the  graph  to  be  very  useful. 

Although  the  time  required  for  the  clocked  algorithms  are  n  clock  cycles,  instead  of  sim¬ 
ply  one  traversal  for  the  selfish  SPT  algorithm,  this  is  somewhat  unfair.  The  selfish  SPT 
algorithm  assumes  that  messages  can  be  passed  instantaneously  without  cost,  so  that  passing 
2"  messages  in  D  steps  only  takes  D  time,  where  D  is  the  diameter  of  the  graph.  In  fact,  this 
may  not  be  always  the  case.  Nevertheless,  the  differences  in  message  passes  and  times 
between  the  two  algorithms  serves  to  illustrate  that  what  one  gains  in  savings  on  message 
passes,  may  be  lost  on  the  additional  time  required.  Finally,  without  having  to  implement 
the  entire  clock  mechanism,  a  finite  delay  at  each  node  will  allow  several  explorers  to  arrive, 
so  that  some  optimization  becomes  possible,  also  at  the  cost  of  time.  The  exact  characteriza¬ 
tion  of  this  message  and  time  trade-off  we  leave  to  further  research. 

In  terms  of  failures  of  nodes  and  links,  it  is  clear  that  such  failures  alter  the  graph,  and 
thus  change  the  shortest  path  tree.  A  restart  of  the  algorithm  is  therefore  mandatory,  follow¬ 
ing  ABORT  of  the  algorithm,  and  a  re-organization  of  the  adjacency  data  for  each  node. 
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Chapter  8 


Conclusion 


In  this  thesis,  we  have  presented  an  environment  in  which  networks  of  computers 
cooperate  without  the  need  for  central  control.  Given  such  an  environment,  the  need  to  ob¬ 
tain  global  information  about  the  network  from  time  to  time  motivated  the  creation  of  algo¬ 
rithms  which,  though  decentralized  in  control,  nevertheless  achieve  a  single  system-wide  end. 
These  algorithms  are  based  on  the  technique  of  parallel  traversal  of  a  general  graph  such 
that  a  minimum  edge  covering  is  obtained.  As  a  class,  they  are  called  echo  algorithms. 

Previous  work  has  been  focused  on  specific  problems  in  distributed  environments,  or  has 
used  more  restrictive  models  of  multiprocessing  systems.  In  this  work,  we  have  introduced  a 
general  technique,  and  a  model  of  computation  for  a  loosely  coupled  system  of  machines, 
which  has  many  desirable  properties. 

Echo  algorithms  function  in  parallel,  and  take  execution  time  for  their  communications 
of  the  order  of  the  diameter  of  the  graph  modelling  the  network  of  machines.  The  total 
elapsed  time  depends  of  course  also  on  the  computing  power  of  each  of  the  interacting 
machines.  Though  echo  algorithms  operate  strictly  on  message  passing,  the  number  of  mes¬ 
sage  passes  in  general  is  bounded  by  4e,  where  e  is  the  number  of  edges  in  the  graph.  The 
storage  required  at  a  node,  for  most  algorithms,  is  small  and  bounded  by  n  2  bits,  where  n  is 
the  number  of  nodes  in  the  graph. 

Some  important  functions  which  echo  algorithms  can  serve  in  a  network  are:  providing  a 
mutual  exclusion  mechanism  where  it  is  important  for  only  one  process  at  a  time  to  proceed, 
such  as  in  the  multiple  file  copy  update  problem,  or  finding  those  components  of  the  graph 
which  are  connected  by  a  single  node  to  the  rest  of  the  graph.  Echo  algorithms  can  be  ap¬ 
plied  to  the  problem  of  finding  deadlock  in  distributed  systems,  and  a  modified  echo  algo¬ 
rithm  can  be  used  to  find  the  shortest  paths  from  one  node  to  all  other  nodes.  In  general, 
we  observe  that  it  is  a  simple  and  versatile  technique. 

8.1.  Reliability 

An  important  area  which  must  be  considered  for  echo  algorithms  is  that  of  reliability. 
Since  they  are  to  operate  in  distributed  systems  which  are  susceptible  to  failures,  we  must 
ask  how  echo  algorithms  behave  in  the  face  of  failures.  We  have  studied  this  problem  in  the 
discussion  of  specific  algorithms.  Although  some  of  them  have  been  shown  to  be  robust,  for 
example,  the  fail-safe  version  of  the  extinction  algorithm  in  a  circular  configuration,  in  gen¬ 
eral,  echo  algorithms  are  highly  dependent  for  their  proper  termination  on  the  functioning  of 
all  nodes  involved. 

For  example,  if  a  node  were  to  fail  while  the  biconnected  components  algorithm  was  ex¬ 
ecuting,  and  an  echo  got  lost,  the  initiator  would  never  get  all  its  echos  back.  In  asynchro¬ 
nous  systems,  the  only  way  to  distinguish  between  waiting  for  an  echo  and  waiting  on  a 

failed  node  for  an  echo  is  a  mechanism  which  detects  the  failure  of  neighbouring  nodes. 

■% 

Even  if  this  situation  were  to  be  found,  this  particular  execution  of  the  biconnected  com¬ 
ponents  algorithm  could  not  proceed  until  the  node  was  restored.  In  this  case,  and  in  many 
other  algorithms  we  have  considered,  the  proper  thing  to  do  is  to  abort  the  current  execution 


of  the  algorithm,  and  cause  it  to  be  restarted. 

There  are  three  possible  types  of  failures.  The  first  is  the  failure  of  a  node  before  the  ac¬ 
tivity  from  an  echo  algorithm  reaches  it.  In  many  cases,  we  can  safely  treat  this  situation  as 
if  the  node  is  not  in  the  graph,  since  its  neighbours  cannot  send  to  it,  and  therefore,  some  al¬ 
gorithms  are  not  adversely  affected.  The  second  is  the  failure  of  a  node  during  the  execution 
of  an  echo  algorithm,  after  it  has  been  visited  by  an  explorer.  The  node  waiting  for  its  echo 
must  detect  this  failure  either  through  a  time  — out ,  and  then  sending  a  query  to  the  failed 
node,  or  else  the  system  must  employ  a  reassurance  mechanism,  whereby  a  node  sends  an 
“OK”  message  to  its  predecessor  periodically.  The  cessation  of  th;s  message  will  cause  the 
waiting  node  to  infer  that  a  node  failure  has  occurred.  When  this  failure  has  been  detected, 
an  ABORT  mechanism,  such  as  the  one  described  in  Chapter  4.  must  be  invoked.  Finally, 
there  is  the  situation  in  which  a  node  fails  following  the  execution  of  an  echo  algorithm. 
The  graph  is  no  longer  the  same,  and  the  result  of  the  echo  algorithm  may  no  longer  be 
valid.  However,  there  is  no  wav  of  letting  the  initiator  know  this.  This  last  problem,  howev¬ 
er,  is  common  to  any  algorithm  operating  in  a  dynamic  network  environment,  and  not 
unique  to  echo  algorithms. 

8.2.  Further  Research 

The  problems  of  decentralized  control  have  by  no  means  all  been  solved  in  this  thesis. 
We  have  presented  a  model  of  a  distributed  system,  and  studied  one  approach  to  decentral¬ 
ized  control  in  this  model.  There  are  many  areas  which  deserve  further  research.  One  of 
them  is  the  careful  elaboration  of  the  relationship  between  elapsed  time,  processor  power, 
and  queueing  for  echo  algorithms  in  general.  In  particular,  it  would  be  useful  to  characterize 
decentralized  algorithms  in  terms  of  total  elapsed  time,  given  the  specifications  for  proces¬ 
sors  and  communications  links. 

Further  work  also  is  needed  in  the  areas  of  reliability  and  robustness  for  specific  echo  al¬ 
gorithms,  particularly  in  the  question  of  detecting  message  loss,  and  handling  node  recovery 
and  re-entry  into  the  network.  Some  quantitative  solutions  would  be  desirable,  for  if  we  ac¬ 
cept  the  inevitability  of  failures,  then  it  would  be  good  to  know-  how  often  algorithms  must 
be  aborted,  and  how  much  wasted  work  this  introduces  into  the  system  as  a  whole. 

Echo  algorithms  are  based  on  the  point  of  view  that  each  node  operates  on  the  same 
program,  and  is  not  a  priori  different  from  any  other  node.  No  formal  model  exists  to 
describe  this  environment,  or  to  lend  proof  techniques  which  will  show  that  given  the 
behaviour  of  a  part,  and  the  relationship  between  the  parts,  we  can  prove  the  correct 
behaviour  of  the  whole  system.  In  this  thesis,  w-e  have  only  used  plausibility  arguments.  The 
best  approach  to  date  has  been  that  of  Dijkstra  [DIJK  79a,  79b],  which  is  based  on  propos¬ 
ing  certain  invariants  for  every  node,  and  then  showing  that  a  program  which  holds  these  in¬ 
variants  true  cannot  be  other  than  correct.  This  is,  however,  not  so  much  a  proof  technique 
as  a  method  of  program  construction.  Furthermore,  we  also  lack  the  tools  which  would  al¬ 
low  us  to  derive  invariants  from  a  specification  of  the  problem.  Thus,  the  choosing  of  invari¬ 
ants  is  still  a  subtle  and  difficult  problem  to  solve. 

We  have  made  a  conjecture  that  the  average  number  of  message  passes  for  electing  the 
largest  numbered  node  in  an  arbitrary  graph  is  n  log/j.  This  topic  is  one  which  can  benefit 
from  additional  work.  For  example,  one  approach  might  be  to  examine  the  change  in  mes¬ 
sage  passes  required  as  one  adds  an  edge  to  a  graph,  or  a  node  and  an  edge.  An  inductive 
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solution  for  some  interesting  classes  of  graphs  might  be  found  in  this  way. 

Further  research  will  also  be  of  interest  in  studying  the  application  of  echo  algorithms  to 
other  graph  problems.  For  example,  the  entire  class  of  problems  which  are  in  NP,  such  as 
finding  a  Hmailtonian  circuit,  have  yet  to  be  investigated  in  a  distributed  environment. 

We  have  implemented  some  of  the  algorithms  described  in  this  thesis,  on  THOTH,  an 
operating  system  supporting  processes  which  communicate  through  message  passing  [CHER 
79].  This  was  done  more  as  a  demonstration  than  as  a  simulation.  Some  typical  examples 
of  the  execution  and  interleaving  of  activity  is  included  in  Appendix  IV. 

In  summary,  in  this  thesis  we  have  presented  two  techniques  which  are  fundamental  to 
decentralized  control  algorithms  for  distributed  systems.  The  first  is  that  of  parallel  traversal 
of  a  graph,  which  is  efficient,  fast,  and  versatile.  The  second  is  that  of  election  by  extinction, 
which  finds  application  in  providing  a  decentralized  mutual  exclusion  mechanism.  In  a  distri¬ 
buted  system  with  no  central  control,  such  a  mechanism  is  shown  to  have  important  use. 
Through  studying  the  application  of  these  techniques  to  different  problems  which  require 
some  global  result  through  the  coordinated  activity  of  all  nodes,  we  believe  that  we  have  also 
gained  a  deeper  understanding  into  the  structure  and  behaviour  of  decentralized  algorithms. 
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Appendix  I 


Derivation  of  n  log«  result  for  circular  configurations 


Our  model  is  that  n  nodes  are  in  a  circle.  We  write  C(a.b)  as  the  number  of  ways  of 
choosing  b  things  from  a  things.  In  Chapter  3,  we  have  shown  that  the  expected  number  of 
message  passes,  the  random  variable  X,  is: 
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Appendix  II 


Derivation  of  n  log  n  results  for  non-circular  graphs 


Star  Graph 

Figure  A.l  shows  a  star  graph  of  n  nodes  named  from  1  to  n.  The  rank  of  each  node  is 
taken  from  its  name.  We  first  assume  a  step-wise  model  of  computation,  in  which  a  node 
fires  by  sending  its  message(s),  and  then  another  node  fires.  Thus,  messages  are  sufficiently 
fast  so  that  between  any  two  firings,  all  messages  are  delivered.  We  modify  this  subse¬ 
quently  to  include  simultaneity. 

The  algorithm  we  consider  is  an  extinction  algorithm.  Let  the  center  node  be  called  C.  It 
has  a  superior  field,  which  we  abbreviate  as  C.sup.  If  v,  a  peripheral  node,  sends  a  message 
to  C,  and  v  is  larger  than  C.sup ,  v  becomes  the  new  value  of  C.sup,  and  more  explorers  are 
sent  from  the  center.  If  v  is  smaller,  then  its  message  is  extinguished. 

If  the  center  is  not  the  first  node  to  start,  then  any  message  coming  to  the  center  will 
cause  it  to  start.  Once  it  starts,  and  sends  e  messages,  all  nodes  which  are  smaller  than  it 
will  be  suppressed.  Only  the  nodes  larger  than  the  center  will  be  triggered.  Each  of  them  ar¬ 
riving  at  the  center,  if  larger  than  the  current  superior,  will  cause  e  messages  to  be  sent. 
Therefore  the  problem  is  only  to  determine  the  average  order  in  which  these  nodes  will  be 
triggered.  If  the  order  is  strictly  ascending,  then  all  nodes  larger  than  the  center  will  cause  e 
messages  to  be  sent.  On  the  other  hand,  if  the  largest  is  triggered  first,  then  it  will  suppress 
all  other  activity.  This  problem  is  exactly  the  number  of  maxima  among  a  set  of  /  things, 
which  is  known  to  be  the  harmonic  number  of  i  [KNUT  73].  We  can  formulate  this 
more  precisely. 

Let  node  n  —i  + 1  be  in  the  center.  Sometimes  it,  or  a  node  larger  than  it,  may  be  the 
first  to  fire,  in  which  case  we  are  into  the  consideration  of  the  average  number  of  maxima  al¬ 
ready.  However,  if  a  smaller  one  fires,  then  one  message  suffices  to  trigger  the  center  node 
to  fire.  Since  there  are  i  —  1  of  these  smaller  nodes,  and  one  of  n  nodes  may  be  the  first  to 
fire,  the  expected  number  of  times  a  smaller  node  starts  first  is  (/  —  \)/n.  There  are  i  nodes 
which  are  equal  to  or  greater  than  n—i  +  l.  Each  of  these  will  trigger  e  messages.  The 
average  number  of  these  which  do  so  is  //,,  the  (z')th  harmonic  number.  Let  the  random 
variable  X  be  the  number  of  message  passes.  Then,  the  expected  value  of  X  for  node 
n—i  +  l  is 


,m-  L~r+eH‘ 

n 


Then 


E{X)  =  -2l(—  +  eHn 
nft\  n 


n  rr 


i=0 


=  ^X\(n+\)Hn-n] 
n 
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Figure  A.l  A  star  graph  of  n  nodes  with  node  n-idat  the  center 


Since  e  =  n  —  1,  this  is 


=  nH  n  —  n  +  1  — 


*  nH  n  ~ n  log  n 


We  now  include  simultaneity.  The  effect  of  simultaneous  startup  will  not  affect  the 
average  number  of  maxima  for  any  /,  but  only  the  number  of  smaller  nodes  which  might 
start  before  messages  from  the  center  suppress  them.  This  is  only  /' —  1  at  most  for  any  case 
with  n—i  +  1  in  the  center.  Therefore,  with  simultaneity,  we  can  expect  the  number  of  mes¬ 
sages  to  be  bound  by  including  this  term.  This  contributes 


Therefore,  with  simultaneity,  the  expected  number  of  messages  for  a  star-graph  is  boun¬ 
ded  above  by  n  log«  +  «/2. 


Fully  Connected  Graph 

Again,  we  first  assume  a  stepwise  extinction  algorithm.  Depending  on  which  node  start¬ 
ing  first,  e  messages  are  sent  out.  If  the  largest  starts,  only  e  need  be  sent.  In  the  worst 
case,  the  nodes  start  in  ascending  sequence,  and  each  node  sends  them  out.  The  problem  is 
again  the  average  number  of  maxima  among  n  things,  which  is  Hn,  each  of  which  sends  e 
messages. 

In  a  fully  connected  graph,  the  number  of  edges  at  each  node  is  n—  1.  Therefore,  the 
expected  number  of  messages  is  (n  —  \)Hn,  which  is  essentially  of  order  n  logn. 

Even  if  all  nodes  started  simultaneously,  for  the  fully  connected  graph  only  the  order  in 
which  each  node  ends  up  firing  is  important.  Thus,  the  analysis  covers  the  possibility  of 
simultaneous  firings. 


a-4 


Appendix  III 


Single  Pass  Knot  Detection  Algorithm 


The  environment  under  consideration  is  a  directed  connected  graph.  For  brevity,  we  will 
call  this  Single  Pass  Knot  algorithm  the  SPK  algorithm.  Let  it  be  initiated  by  the  node  S. 
The  SPK  algorithm  is  a  modified  echo  algorithm  in  an  important  sense:  an  edge  which  has 
already  carried  an  echo  may  carry  further  echos. 

We  explain  the  algorithm  informally,  then  describe  it  in  more  detail.  The  algorithm  is 
based  on  the  property  that  all  nodes  reachable  from  the  initiator  5  must  also  reach  5.  An 
explorer  stops,  as  with  any  echo  algorithm,  at  visited  nodes  or  at  sinks  and  produce  initial 
echos.  An  echo  has  a  status  which  can  be  one  of  four  different  values: 
visit,  connect,  vis  —conn,  sink.  A  node  acquires  one  of  these  status  values  as  it  performs 
echo-merges  of  the  echos  it  receives.  Initially,  every  node  has  a  status  of  NULL. 

An  explorer  which  comes  to  5  produces  an  initial  echo  with  status  connect.  If  it  comes 
to  a  sink  node,  with  no  out-edges,  the  initial  echo  has  status  sink.  If  the  explorer  comes  to 
an  already  visited  node  which  is  neither  of  the  above,  it  terminates  and  produces  an  initial 
echo  with  status  visit.  Status  vis— conn  is  an  intermediate  status  at  a  node,  describing  that 
one  of  its  explorers  went  to  a  visited  node,  while  the  other  went  to  a  node  which  we  know 
reaches  S.  If  S'  receives  echos  which  are  all  of  status  connect ,  then  all  its  reachable  nodes 
reach  it,  and  it  is  in  a  knot. 

The  deviation  of  the  SPK  algorithm  from  normal  echo  algorithms  is  that  an  edge  may 
carry  several  echos.  This  occurs  if  a  node  acquires  a  status  which  is  different  from  its 
current  status.  We  will  describe  this  aspect  further.  In  particular,  however,  a  node  keeps  a 
count  of  the  number  of  echos  on  each  of  its  edges,  as  well  as  the  status  associated  with  the 
most  recent  echo  on  the  edge.  Call  this  count  for  an  edge  e,  e.ct.  Initially  it  is  zero  for 
each  edge. 

We  will  first  describe  the  rules  for  echo-merge,  which  we  call  Status  Combination  Rules. 
These  are  called  upon  when  a  node  has  received  an  echo  from  all  its  out-edges.  If  the  result 
of  Status  Combination  is  different  from  the  current  node  status,  the  node  attempts  to  echo. 
The  rules  are  to  be  taken  in  order,  so  (3)  applies  if  (2)  and  (1)  do  not,  etc. 

1.  If  any  echo  has  status  sink ,  the  node  gets  status  sink. 

2.  If  all  echos  have  status  visit,  then  the  node  gets  a  status  of  visit. 

3.  If  any  out-edge  e  has  e.ct  =  1,  and  its  status  is  visit  or  vis -conn,  then  the  node  has 
status  vis  —conn. 

4.  If  none  of  the  above,  then  the  node  has  status  of  connect. 

Now  we  can  describe  the  SPK  algorithm.  Explorers  behave  very  simply,  and  echos  are 
complex: 
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Figure  A. 2.  Two  types  of  cycles,  illustrating  the  Status  Combination  Rules 


1.  An  explorer  coming  to  an  unvisited  node  on  edge  e  marks  e  as  the  first  edge  of  the 
node.  It  increments  e.ct ,  and  sends  explorers  in  parallel  on  the  out-edges  of  the  node. 

2.  An  explorer  coming  to  a  visited  node  on  an  edge  e  increments  e.ct,  and  if  the  node 
status  is  NULL,  produces  an  initial  echo  with  status  visit.  If  it  comes  to  a  sink  node, 
the  node  acquires  a  status  of  sink ,  and  the  echo  produced  has  a  status  of  sink.  If  the 
node  status  is  not  NULL,  the  echo  status  takes  the  value  of  the  node  status. 

3.  An  explorer  at  the  initiator  node  5"  produces  an  initial  echo  with  status  connect.  In  all 
cases,  the  status  associated  with  the  edge  e  is  recorded  at  the  node. 

4.  An  echo  which  arrives  at  a  node  on  edge  e  increments  e.ct  by  1,  and  records  its  status 
for  edge  e.  It  then  induces  the  Status  Combination  Rules.  If  the  status  produced  is 
the  same  as  the  current  node  status,  then  nothing  more  need  be  done  with  this  echo. 

5.  A  node  trying  to  echo  checks  if  it  is  S  and  its  status  is  connect.  If  so,  it  is  in  a  knot. 

6.  A  non-initiator  node  creates  an  echo  on  its  first  edge  with  status  taken  from  the  node, 
and  sends  it.  It  now  considers  all  its  in-edges.  An  in-edge  e  which  has  e.ct  >  1  and 
has  a  status  different  from  the  node  status  also  gets  the  echo.  In  that  case,  e.ct  is  incre¬ 
mented  again. 

This  completes  the  description  of  the  SPK  algorithm.  The  main  reason  why  this  algo¬ 
rithm  is  so  complex  is  that  it  has  to  be  able  to  identify  extended  sinks,  which  are  sub-graphs 
which  are  reachable  from  nodes  reached  from  S,  but  have  no  sinks  or  edges  leading  to  S’  or 
to  a  node  reaching  S.  These  extendedsinks  may  be  knots  themselves  or  contain  knots.  In  any 
case,  the  only  status  which  an  echo  returning  on  an  edge  leading  into  them  can  have  is  visit. 

However,  it  is  essential  that  the  algorithm  distinguish  between  an  edge  carrying  an  echo 
with  status  visit  from  an  extended  sink,  from  an  edge  which  carried  an  explorer  to  a  node 
which  has  already  been  visited,  but  may  be  able  to  reach  S.  This  is  why  nodes  which  acquire 
new  status  must  try  to  echo  on  edges  already  echoed.  There  are  other  conditions  which  arise 
from  the  presence  of  two  types  of  cycles,  which  are  illustrated  in  Figure  A. 2.  The  apprecia¬ 
tion  of  the  Status  Combination  and  echo  mechanisms  in  these  situations  is  left  to  the  reader. 
Figure  A. 3  illustrates  an  extended  sink,  and  intermediate  values  of  status  which  nodes  and 
echos  have  during  the  execution  of  the  SPK  algorithm. 

It  can  be  seen  that  in  attempting  to  collect  all  the  information  necessary  to  determine 
whether  or  not  a  directed  graph  is  a  knot  requires  a  high  order  of  complexity.  However, 
there  are  only  a  finite  number  of  status  possibilities,  and  therefore  all  activity  must  eventu¬ 
ally  terminate.  Furthermore,  because  of  this,  each  edge  can  only  carry  a  finite  number  of 
echos,  which  is  at  most  3.  Therefore,  the  maximum  number  of  message  passes  is  4e,  where  e 
is  the  number  of  edges  in  the  graph.  The  time  required  for  the  algorithm  to  complete  all 
echo  activity,  and  D  for  explorers  to  reach  the  entire  graph.  Therefore,  the  total  elapsed 
time  is  bounded  above  by  4 D,  where  D  is  the  diameter  of  the  graph. 
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status  VISIT 
status  CONNECT 
status  VIS-CONN 


Combination  Rules 


Appendix  IV 


Implementation  Results 


For  purposes  of  demonstration,  some  of  the  algorithms  were  implemented  on  Thoth 
[CHER  79],  an  operating  system  which  uses  message  passing  as  the  only  form  of  communi¬ 
cation  between  processes,  using  a  single  machine  environment.  Thus,  we  were  able  to  run 
many  of  the  decentralized  algorithms  in  a  pseudo  concurrent  fashion,  relying  strictly  on  mes¬ 
sages  between  processes  for  communications.  In  fact,  the  algorithms  below  ran  on  both  a 
Texas  Instruments  990/10  and  a  990/05. 

We  implemented  the  basic  extinction  election  algorithm  for  a  circular  configuration,  and 
then  the  election  among  k  contenders  algorithm,  also  in  a  circle.  In  addition,  to  show  that 
non-trivial  algorithms  are  also  feasible,  we  demonstrated  the  unclocked  version  of  the  shor¬ 
test  path  tree  algorithm.  In  addition,  we  have  also  been  able  to  demonstrate  the  implemen¬ 
tation  of  the  two  pass  algorithm  for  identifying  all  strongly  connected  components  in  a 
directed  graph.  It  is  a  testimonial  to  both  the  ease  of  use  of  Thoth,  as  well  as  to  the  simpli¬ 
city  of  these  algorithms,  that  the  entire  implementation  project  took  only  seven  working 
days,  from  learning  Thoth  and  its  programming  language  Zed  to  completing  all  the  demons¬ 
tration  runs. 

The  pages  that  follow  contain  sample  executions  of  the  decentralized  algorithms.  They 
have  the  following  structure:  first,  the  configuration  of  the  graph  is  entered.  Nodes  are  num¬ 
bered  from  0  to  n  —  1,  and  are  described  in  sequence.  Each  node  is  entered  with  a  list  of  the 
neighbours  to  which  it  is  connected.  The  programs  do  not  check  for  consistency  of  edge  con¬ 
nectivities.  Next,  as  a  node  receives  a  message,  or  undergoes  some  activity,  it  produces  a 
message.  Thus,  the  interleaved  activity  of  autonomous  nodes  is  simulated  in  this  fashion. 
Each  execution  has  a  diagram  which  illustrates  the  activity  which  took  place. 

The  basic  extinction  algorithm  produces  an  activity  message  of  the  form  n:m(x).  The 
interpretation  is  that  at  node  n,  a  message  from  node  m  is  received,  and  the  message  has 
value  x,  i.e.,  originates  from  node  x.  An  m  value  of  99  represents  the  start  message  from 
the  control  environment,  and  in  our  examples,  both  nodes  2  and  4  receive  a  start  message 
from  node  99.  In  other  words,  the  election  begins  by  these  2  nodes  sending  their  ballots  in¬ 
dependently.  The  ^-contenders  algorithm  uses  exactly  the  same  activity  message  format, 
and  the  same  method  of  starting. 

The  pure  traversal  algorithm  uses  the  output  format  n:r<m,  which  means  that  node  n 
received  a  message  from  node  m  of  type  r.  If  r  =x,  then  an  explorer  was  received,  while  if 
r  =  e,  then  an  echo  was  received. 

For  the  shortest  path  tree  algorithm,  we  have  used  several  message  types  to  show  the  ac¬ 
tivity  at  each  node.  The  form  n\{r)<m  means  that  node  n  received  an  explorer  from  node 
m  carrying  weight  r.  The  form  n\{r)-*m  means  that  node  n  sent  an  explorer  to  node  m ,  with 
weight  r.  Similiarly,  the  messages  nt<m  and  ner+m  have  equivalent  meanings  for  echos.  Fi¬ 
nally,  the  message  n=>m  means  that  the  edge  ( n.m ),  directed  from  n  to  m ,  is  in  the  shor¬ 
test  path  tree. 

In  the  first  execution  of  the  strong  components  algorithm,  we  have  produced  messages 
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which  illustrate  the  decentralized  activity  at  the  nodes.  For  the  other  examples,  only  the 
strong  components  themselves  are  produced  as  output.  In  the  first  example,  then,  the  mes¬ 
sage  n:tag<m  means  that  node  rt  received  a  tag  message  from  node  m.  Thus,  n:te<m  obvi¬ 
ously  means  that  node  n  received  an  echo  corresponding  to  a  tag  message  it  sent  out.  from 
node  m.  Similiarly,  n:exp<m  represents  node  n  receiving  an  explorer,  in  the  mark  phase, 
from  node  m,  and  n:ee<m  represents  node  rt  receiving  an  echo  for  an  explorer,  from  node 
m.  The  message  ntSTART  refers  to  the  node  which  is  currently  trying  to  identify  the 
members  of  its  strong  component.  Finally,  the  most  important  message  is  of  the  form  n‘m), 
which  means  that  node  n  is  a  member  of  the  strong  component  named  m. 
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nodes 


> / -  core 
4 

Circle  with  5 
0  1 

1  4 

2  0 

3  2 

4  3 

§§2 : 99  (0  ) 
4:99(0) 
§§0:2(2) 

3:4(4) 

§ §2 : 3 (4  ) 

1:0(2) 

§0:2(4) 

§4:1(2) 

§1:0(4) 

§4:1(4) 

node  4  elected 


* 


> /  .  cor e 
5 

Circle  with  6  nodes 
0  3 

1  4 

2  0 

3  1 

4  5 

5  2 

§2 : 99  (0  ) 

§4 : 99  (0  ) 

§0:2(2) 

§5:4(4) 

§3:0(2) 

§2:5(5) 

§1:3(3) 

§0:2(5) 

§4:1(3) 

§3:0(5) 

§1:3(5) 

§4:1(5) 

§5:4(5) 

node  5  elected 


BASIC  EXTINCTION  ELECTION  IN  A  CIRCLE 


Number  of  nodes? 

6 

Circle  with  6  nodes 


0  3 

1  4 

2  0 

3  1 

4  5 

5  2 

Contenders?  Enter’  k  followed  by  names 

4  13  5  0 

@1:99(0) 

@3 : 99  (0  ) 

@5:99(0) 

@0 : 99  (0  ) 

@4:1(1) 

@2:5(5) 

@1:3(3) 

@3:0(0) 

@5:4(1) 

@0:2(5) 

@4:1(3) 

@3:0(5) 

@5:4(3) 

@1:3(5) 

@4:1(5) 

@5:4(5) 


node  5  elected 


ELECTION  AMONG  k-CONTENDERS  IN  A  CIRCLE 


Size  of  graph? 


2  :  e  <  1 

Parallel  Traversal  Done 


Size  of  graph? 

4 

Graph  with  4  nodes 
0  12  3 
0 :  up 
1  0  2 

1  :  up 

2  0  13 
2 :  up 

3  0  2 

3  :  up 
0  :  x<2 

1  :  x<2 

3:  x  <2 
0:  x<3 
1  :  x<0 

0  :  x  <  1 
3  :  x  <  0 
0  :  e  <  1 
3  :  e  <  0 
0  :  e  <  3 
2:  e  <  3 

1  :  e<0 

2 :  e<0 

2:  e<  1 

Parallel  Traversal  Done 

! 


PURE  TRAVERSAL  ALGORITHM 


> /  .  cor e 

Size  of  graph? 

4 

Graph  with  4  nodes 
0  12  3 
0 


1 

1 

2 

2 

3 

3 


up 

2  3 

up 

3  0 

up 

0  1 
up 


3:  x<2 
0:  x  <2 


1 

0 

3 

1 

3 

1 


x  <2 
x<3 
x  <  1 
x  <0 
x<0 
x<3 


3:  e<0 

0:  x <  1 


1 

0 

0 

1 

2 

3 

2 

2 


e<3 
e<3 
e<  1 
e  <0 
e<0 
e  <  1 
e  <  1 
e<3 


Parallel  Traversal  Done 


PURE  TRAVERSAL  ALGORITHM 


>/ . /basic 
Size  of  graph? 
6 


Graph  with  6  nodes 


0  1 
0 :  up 
10  2  3 
1 :  up 
2  1  5 
2 :  up 

3  1  5 
3 :  up 

4  1  5 
4 :  up 

5  2  3  4 
1  :  x<2 
0:  x <  1 
3:  x <  1 
4:  x <  1 

5 :  up 
1  :  e<0 
5:  x<2 
5:  x<3 
5:  x<4 
3:  x<5 
4:  x<5 
3:  e<5 
5 :  e<3 
1  :  e<3 
4:  e<5 
5:  e<4 
1  :  e<4 
2:  e<5 
2:  e<  1 
Parallel 


Traversal  Done 


PURE  TRAVERSAL  ALGORITHM 


Graph  with  4 
PARENT  =  179 


0 

1  :  2 

2  : 

1 

0 

up 

1 

0  :  2 

2  : 

13:1 

1 

up 

2 

0  :  1 

1  : 

1  3:2 

2 

up 

3 

1  :  1 

2  : 

2 

3 

up 

1  x(2  )  <0 

2  x( 1  )  <0 

1  x ( 3  )  ->  2 

2  x  (  2  )  -  >  1 
1  x  (  3  )  -  >  3 

1  x  (  2  )  <  2 

2  x( 3  )->3 

1  e  -  >  2 

3  x  (  3  )  <  1 

2  x  (  3  )  <  1 

3  x  (  5  )  -  >  2 

2  e- >  1 

3  x  (  3  )  <  2 

2  e  <  1 

3  e-  >  2 

2  x  (  5  )  <  3 
2  e- >  3 

1  e  <  2 

2  e  <  3 

2  e  -  >  0 

3  e  <  2 
0  e<  2 

3  e-  >  1 
1  e<  3 
1  e->0 

0  e<  1 
0  :  e  n  d 
0  =  >  1 
0  =  >  2 
1  rend 
1  =>  3 


odes 


UNCLOCKED  SHORTEST  PATH  TREE  ALGORITHM 


Size  of  graph? 

4 

Graph  with  4  nodes 
PARENT  =  115 
0  1:3  2:3  3:1 

0  up 

1  0:3  3:1 

1  up 

2  0:3  3:1 

2  up 

3  0:1  1:1  2:1 
3  up 

1  x ( 3  )  <0 

2  x ( 3  )  <0 

3  x( 1  )<0 

3  x ( 2 ) ->  1 

1  x  (  4  )  -  >  3 

2  x ( 4  ) -> 3 

3  x  (  2  )  -  >  2 
3  x  (  4  )  <  1 

1  x  ( 2  )  <  3 
3  e- >  1 
3  x ( 4  )  <2 

1  x  (  5  )  -  >  0 

2  x  (  2  )  <  3 

3  e-  >  2 

0  x  (  5  )  <  1 

1  e<  3 

2  x ( 5  ) ->0 
0  e-  >  1 

1  e->0 

0  e<  1 

1  e<0 

2  e< 3 

0  x ( 5 ) <2 
0  e-  >  2 

1  e->  3 

2  e->0 

3  e  <  1 
0  e<2 
2  e  <  0 

2  e->  3 

3  e<2 

3  e- >0 

0  e< 3 
0  :  end 
0  =  >  3 
3 :  end 
3  =  >  1 
3  =  >  2 


UNCLOCKED  SHORTEST  PATH  TREE  ALGORITHM 


Graph 

vi 

h  4 
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J.J.  Horning,  December  1974 

CSRG-48  THE  SYNTHESIS  OF  OPTIMAL  DECISION  TREES  FROM 
DECISION  TABLES 

Helmut  Schumacher,  December  1974 

[M.Sc.  Thesis,  DCS,  1974;  CACM,  v.19,  n.6,  June  1976] 

*  CSRG-47  LANGUAGE  DESIGN  TO  ENHANCE  PROGRAMMING  RELIABILITY 

John  D.  Gannon,  January  1975 
[Ph.D.  Thesis.  DCS.  1975] 

*  CSRG-48  DETERMINISTIC  LEFT  TO  RIGHT  PARSING 

Christopher  J.M.  Turnbull,  January  1975 
[Ph.D.  Thesis,  EE,  1974] 

*  CSRG-49  A  NETWORK  FRAMEWORK  FOR  RELATIONAL  IMPLEMENTATION 

D.  Tsichritzis,  February  1975  [in  Data  Base  Description, 

Dongue  and  Nijssen  (eds.),  North  Holland  Publishing  Co.] 
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*  CSRG-50  A  UNIFIED  APPROACH  TO  FUNCTIONAL  DEPENDENCIES 

AND  RELATIONS 

P.A.  Bernstein,  J.R.  Swenson  and  D.C.  Tsichritzis 
February  1975  [Proceedings  of  the  ACM  SIGMOD 
Conference,  1975] 

*  CSRG-51  ZETA:  A  PROTOTYPE  RELATIONAL  DATA  BASE  MANAGEMENT  SYSTEM 

M.  Brodie  (ed).  February  1975  [Proceedings  Pacific  ACM 
Conference,  1975] 

CSRG-52  AUTOMATIC  GENERATION  OF  SYNTAX-REPAIRING  AND 
PARAGRAPHING  PARSERS 
David  T.  Barnard,  March  1975 
[M.Sc.  Thesis,  DCS,  1975] 

*  CSRG-53  QUERY  EXECUTION  AND  INDEX  SELECTION  FOR  RELATIONAL 

DATA  BASES 

J.H.  Gilles  Farley  and  Stewart  A.  Schuster,  March  1975 

CSRG-54  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.V.  Guttag  (ed.),  Third  Edition,  April  1975 

CSRG-55  STRUCTURED  SUBSETS  OF  THE  PL/l  LANGUAGE 

Richard  C.  Holt  and  David  B.  Wortman,  May  1975 

*  CSRG-56  FEATURES  OF  A  CONCEPTUAL  SCHEMA 

D.  Tsichritzis,  June  1975  [Proceedings  Very  Large 
Data  Base  Conference,  1975] 

*  CSRG-57  MERLIN:  TOWARDS  AN  IDEAL  PROGRAMMING  LANGUAGE 

Eric  C.R.  Hehner,  July  1975 

see  Acta  Informatica  Col.  10,  No. 3,  pp. 229-243,  1978 

CSRG-58  ON  THE  SEMANTICS  OF  THE  RELATIONAL  DATA  MODEL 
Hans  Albrecht  Schmid  and  J.  Richard  Swenson, 

July  1975  [Proceedings  of  the  ACM  SIGMOD  Conference,  1975] 

*  CSRG-59  THE  SPECIFICATION  AND  APPLICATION  TO  PROGRAMMING 

OF  ABSTRACT  DATA  TYPES 
John  V.  Guttag,  September  1975 
[Ph.D.  Thesis,  DCS,  1975] 

*  CSRG-60  NORMALIZATION  AND  FUNCTIONAL  DEPENDENCIES  IN  THE 

RELATIONAL  DATA  BASE  MODEL 
Phillip  Alan  Bernstein,  October  1975 
[Ph.D.  Thesis,  DCS,  1975] 

*  CSRG-61  LSL:  A  LINK  AND  SELECTION  LANGUAGE 

D.  Tsichritzis,  November  1975  [Proceedings  ACM 
SIGMOD  Conference,  1^76] 
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*  CSRG-82  COMPLEMENTARY  DEFINITIONS  OF  PROGRAMMING  LANGUAGE 

SEMANTICS 

James  E.  Donahue,  November  1975 
[Ph.D.  Thesis,  DCS.  1975] 

CSRG-83  AN  EXPERIMENTAL  EVALUATION  OF  CHESS  PLAYING  HEURISTICS 
Lazio  Sugar,  December  1975 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-64  A  VIRTUAL  MEMORY  SYSTEM  FOR  A  RELATIONAL  ASSOCIATIVE 
PROCESSOR 

S.A.  Schuster,  E.A.  Ozkarahan,  and  K.C.  Smith, 

February  1976  [Proceedings  National  Computer 
Conference  1976,  v.45,  pp. 855-862] 

CSRG-65  PERFORMANCE  EVALUATION  OF  A  RELATIONAL  ASSOCIATIVE 
PROCESSOR 

E.A.  Ozkarahan,  S.A.  Schuster,  and  K.C.  Sevcik, 

February  1976  [ACM  Transactions  on  Database 
Systems,  v.  1,  n:4,  December  1976] 

CSRG-66  EDITING  COMPUTER  ANIMATED  FILM 
Michael  D.  Tilson,  February  1976 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-67  A  DIAGRAMMATIC  APPROACH  TO  PROGRAMMING  LANGUAGE 
SEMANTICS 

James  R.  Cordy,  March  1976 
[M.Sc.  Thesis,  DCS,  1976] 

*  CSRG-68  A  SYNTHETIC  ENGLISH  QUERY  LANGUAGE  FOR  A  RELATIONAL 

ASSOCIATIVE  PROCESSOR 

L.  Kerschberg,  E.A.  Ozkarahan,  and  J.E.S.  Pacheco, 

April  1976 

CSRG-69  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

D.  Barnard  and  D.  Thompson  (eds.),  Fourth  Edition, 

May  1976 

*  CSRG-70  A  TAXONOMY  OF  DATA  MODELS 

L.  Kerschberg,  A.  Klug,  and  D.Tsichritzis,  May  1976 
[Proceedings  Very  Large  Data  Base  Conference,  1976] 

*  CSRG-71  OPTIMIZATION  FEATURES  FOR  THE  ARCHITECTURE  OF  A 

DATA  BASE  MACHINE 

E. A.  Ozkarahan  and  K.C.  Sevcik,  May  1976 

[ACM  Transactions  of  Database  Systems,  v.2,  n.4,  December  1977] 

*  CSRG-72  THE  RELATIONAL  DATA  BASE  SYSTEM  OMEGA  -  PROGRESS  REPORT 

H.A.  Schmid  (ed.),  P.A.  Bernstein  (ed.),  B.  Arlow, 

R.  Baker  and  S.  Pozgaj,  July  1976 
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CSRG-73  AN  ALGORITHMIC  APPROACH  TO  NORMALIZATION  OF 
RELATIONAL  DATA  BASE  SCHEMAS 
P.A.  Bernstein  and  C.  Beeri,  September  1976 

•  CSRG-74  A  HIGH-LEVEL  MACHINE-ORIENTED  ASSEMBLER  LANGUAGE 
FOR  A  DATA  BASE  MACHINE 
E.A.  Ozkarahan  and  S.A.  Schuster,  October  1976 

CSRG-75  DO  CONSIDERED  OD:  A  CONTRIBUTION  TO  THE  PROGRAMMING 
CALCULUS 

Eric  C.R.  Hehner,  November  1976 
Acta  Informatica  to  appear  1979 

CSRG-76  SOFTWARE  HUT:  A  COMPUTER  PROGRAM  ENGINEERING' 

PROJECT  IN  THE  FORM  OF  A  GAME 

J.J.  Horning  and  D.B.  Wortman,  November  1976 

[IEEE  Transactions  on  Software  Engineering,  v.SE-3,  n.4,  July  1977] 

CSRG-77  A  SHORT  STUDY  OF  PROGRAM  AND  MEMORY  POLICY  BEHAVIOUR 
G.  Scott  Graham,  January  1977 

CSRG-78  A  PANACHE  OF  DBMS  IDEAS 

D.  Tsichritzis  (ed.),  February  1977 

CSRG-79  THE  DESIGN  AND  IMPLEMENTATION  OF  AN  ADVANCED  LALR 
PARSE  TABLE  CONSTRUCTOR 
David  H.  Thompson,  April  1977 
[M  .Sc.  Thesis,  DCS,  1976] 

CSRG-BO  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

D.  Barnard  (ed.),  Fifth  Edition,  May  1977 

CSRG-81  PROGRAMMING  METHODOLOGY:  AN  ANNOTATED  BIBLIOGRAPHY 
FOR  IFIP  WORKING  GROUP  2.3 

Sol  J.  Greenspan  and  J.J.  Horning  (eds.),  First  Edition,  May  1977 
CSRG-82  NOTES  ON  EUCLID 

edited  by  W.  David  Elliot  and  David  T.  Barnard,  August  1977 

CSRG-83  TOPICS  IN  QUEUEING  NETWORK  MODELING 
edited  by  G.  Scott  Graham,  July  1977 

CSRG-84  TOWARD  PROGRAM  ILLUSTRATION 

Edward  Yarwood,  September  1977 
[M.Sc.  Thesis,  DCS,  1974] 

CSRG-85  CHARACTERIZING  SERVICE  TIME  AND  RESPONSE  TIME 

DISTRIBUTIONS  IN  QUEUEING  NETWORK  MODELS  OF  COMPUTER 
SYSTEMS 

Edward  D.  Lazowska,  September  1977 
[Ph.D.  Thesis,  DCS,  19?7] 
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CSRG-86  MEASUREMENTS  OF  COMPUTER  SYSTEMS  FOR  QUEUEING 
NETWORK  MODELS 
Martin  G.  Kienzle,  October  1977 

[M.Sc.  Thesis,  DCS,  1977;  Proc.  Int.  Symp.  on  Modelling  and  Performance 
Evaluation  of  Computer  Systems,  Vienna,  1979] 

CSRG-87  ’OLGA’  LANGUAGE  REFERENCE  MANUAL 

B.  Abourbih,  H.  Trickey,  D.M.  Lewis,  E.S.  Lee, 

P.I.P.  Boulton,  November  1977 

CSRG-B8  USING  A  GRAMMATICAL  FORMALISM  AS  A  PROGRAMMING  LANGUAGE 
Brad  A.  Silverberg,  January  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-89  ON  THE  IMPLEMENTATION  OF  RELATIONS:  A  KEY  TO  EFFICIENCY 
Joachim  W.  Schmidt,  January  1978 

CSRG-90  DATA  BASE  MANAGEMENT  SYSTEM  USER  PERFORMANCE 
Frederick  H.  Lochovsky,  April  1978 
[Ph.D.  Thesis,  DCS.  1978] 

CSRG-91  SPECIFICATION  AND  VERIFICATION  OF  DATA  BASE 
SEMANTIC  INTEGRITY 
Michael  Lawrence  Brodie,  April  1978 
[Ph.D.  Thesis,  DCS,  1978] 

CSRG-92  STRUCTURED  SOUND  SYNTHESIS  PROJECT  (SSSP): 

AN  INTRODUCTION 

by  William  Buxton,  Guy  Fedorkow,  with  Ronald  Baecker, 

Gustav  Ciamaga,  Leslie  Mezei  and  K.C.  Smith,  June  1978 

CSRG-93  A  DEVICE-INDEPENDENT,  GENERAL-PURPOSE  GRAPHICS  SYSTEM 
IN  A  MINICOMPUTER  TIME-SHARING  ENVIRONMENT 
William  T.  Reeves,  August  1978 
[M.Sc.  Thesis,  DCS,  1976] 

CSRG-94  ON  THE  AXIOMATIC  VERIFICATION  OF 
CONCURRENT  ALGORITHMS 
Christian  Lengauer,  August  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-95  PISA:  A  PROGRAMMING  SYSTEM  FOR  INTERACTIVE 
PRODUCTION  OF  APPLICATION  SOFTWARE 
Rudolf  Marty,  August  1978 

CSRG-96  ADAPTIVE  MICROPROGRAMMING  AND  PROCESSOR  MODELING 
Walter  G.  Rosocha 
[Ph.D.  Thesis,  EE,  August  1978] 

CSRG-97  DESIGN  ISSUES  IN  THE  FOUNDATION  OF  A  COMPUTER-BASED 
TOOL  FOR  MUSIC  COMPOSITION 
William  Buxton 

[M.Sc.  Thesis,  CSRG,  October  1978] 


CSRG-98  THEORY  OF  DATABASE  MAPPINGS 
Anthony  C.  Klug 

[Ph.D.  Thesis,  DCS,  December  1978] 

CSRG-99  HIERARCHICAL  COROUTINES:  A  MECHANISM  FOR  IMPROVED 
PROGRAM  STRUCTURE 
Leonard  I.  Vanek,  February  1979 

CSRG-100  TOPICS  IN  PERFORMANCE  EVALUATION 
G.  Scott  Graham  (ed.),  July  1979 

CSRG-101  A  PANACHE  OF  DBMS  IDEAS  II 

F.H.  Lochovsky  (ed.).  May  1979 

CSRG-102  A  SIMPLE  SET  THEORY  FOR  COMPUTING  SCIENCE 
Eric  C.R.  Hehner,  May  1979 

CSRG-103  THE  CENTRALIZED  ALGORITHM  IN  DISTRIBUTED  SYSTEMS 
Ernest  J.H.  Chang 
[Ph.D.  Thesis,  DCS,  July  1979] 
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